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Abstract 

No-regret  algorithms  are  a  popular  class  of  online  learning  rules.  Unfortunately,  most 
no-regret  algorithms  assume  that  the  set  y  of  allowable  hypotheses  is  small  and  dis¬ 
crete.  We  consider  instead  prediction  problems  where  y  has  internal  structure:  y 
might  be  the  set  of  strategies  in  a  game  like  poker,  the  set  of  paths  in  a  graph,  or  the 
set  of  configurations  of  a  data  structure  like  a  rebalancing  binary  search  tree;  or  y 
might  be  a  given  convex  set  (the  “online  convex  programming”  problem)  or  in  general 
an  arbitrary  bounded  set.  We  derive  a  family  of  no-regret  learning  rules,  called  La- 
grangian  Hedging  algorithms,  to  take  advantage  of  this  structure.  Our  algorithms  are 
a  direct  generalization  of  known  no-regret  learning  rules  like  weighted  majority  and 
external-regret  matching.  In  addition  to  proving  regret  bounds,  we  demonstrate  one 
of  our  algorithms  learning  to  play  one-card  poker. 
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1  Introduction 


In  a  sequence  of  trials  we  are  required  to  pick  hypotheses  yi,y2,  ■  ■  ■  €.  y.  After  we  choose 
yt,  the  correct  answer  is  revealed  in  the  form  of  a  convex  expected-loss  function  it(yt)-1 
Just  before  seeing  the  tth  example,  our  total  loss  is  therefore 

t- 1 

Lt  =  2>0/<) 

i=  1 

If  we  had  predicted  using  some  fixed  hypothesis  y  instead,  then  our  loss  would  have  been 
X^=i  40/)-  We  say  that  our  total  regret  at  time  t  for  not  having  used  y  is  the  difference 
between  these  two  losses: 

t- 1 

Pt(y)  =  Lt  ~^2ii{y) 

i— 1 

Positive  regret  means  that  the  loss  for  y  is  smaller  than  our  actual  loss — that  is,  we  would 
rather  have  used  y.  Our  overall  regret  is  our  regret  for  not  having  used  the  best  hypothesis 

yey-. 

Pt  =  sup  Ptiy) 

yey 

No-regret  algorithms  are  a  popular  class  of  learning  rules  which  always  have  small  regret  no 
matter  what  sequence  of  examples  they  see.  This  no-regret  property  is  a  strong  guarantee: 
it  holds  for  all  comparison  hypotheses  y  €  y,  even  though  we  are  choosing  which  y  to 
compare  ourselves  to  after  seeing  it  for  all  t.  And,  it  holds  even  if  the  loss  functions  it  are 
statistically  dependent  from  trial  to  trial;  such  dependence  could  result  from  unmeasured 
covariates,  or  from  the  action  of  an  external  agent. 

Unfortunately,  many  no-regret  algorithms  assume  that  the  predictions  yt  are  proba¬ 
bility  distributions  over  a  small,  discrete  set.  This  assumption  limits  their  applicability: 
in  many  interesting  prediction  problems  (such  as  finding  the  best  pruning  of  a  decision 
tree,  playing  poker,  balancing  an  online  binary  search  tree,  and  planning  paths  with  an 
adversary)  the  predictions  have  some  internal  structure.  For  example,  in  a  game  of  poker 
(see  Section  10  below),  the  prediction  must  be  a  valid  poker  strategy  which  specifies  how 
to  play  during  the  next  hand. 

So,  we  consider  prediction  problems  where  y  is  a  larger  set  with  internal  structure,  and 
derive  new  learning  rules — the  Lagrangian  Hedging  algorithms — which  take  advantage  of 
this  structure  to  provide  tighter  regret  bounds  and  run  faster.  The  LH  algorithms  are  a 

1Many  problems  use  loss  functions  of  the  form  £t(yt)  =  £(yt,  VtTue),  where  £  is  a  fixed  function  such  as 
squared  error  and  yilue  is  a  target  output.  The  more  general  notation  allows  for  problems  where  there 
may  be  more  than  one  correct  prediction. 
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direct  generalization  of  known  no-regret  learning  rules  like  weighted  majority  and  external- 
regret  matching,  and  they  reduce  to  these  rules  when  choosing  from  a  small  discrete  set 
of  predictions. 


2  Structured  prediction  problems 

2.1  Problem  definition 

Our  algorithm  chooses  its  prediction  at  each  round  from  a  hypothesis  set  y.  We  assume 
that  y  is  a  compact  subset  of  that  has  at  least  two  elements. 

In  classical  no-regret  algorithms  such  as  weighted  majority,  y  is  a  simplex.  The  corners 
of  y  represent  pure  actions,  the  interior  points  of  y  represent  probability  distributions 
over  pure  actions,  and  the  number  of  corners  n  is  the  same  as  the  number  of  dimensions  d. 
In  a  structured  prediction  problem,  on  the  other  hand,  y  may  have  many  more  extreme 
points  than  dimensions,  n>tl.  For  example,  y  could  be  a  convex  set  like 

{y  \  Ay  =  b,  y  >  0} 

for  some  matrix  A  and  vector  b  (in  which  case  the  number  of  extreme  points  can  be 
exponential  in  d ),  or  it  could  be  a  sphere  (which  has  infinitely  many  extreme  points),  or 
it  could  be  a  set  of  discrete  points  like  the  corners  of  a  hypercube. 

The  shape  of  y  captures  the  structure  in  our  structured  prediction  problem.  Each 
point  in  y  is  a  separate  hypothesis,  but  the  losses  of  different  hypotheses  are  related  to 
each  other  because  they  are  all  embedded  in  the  common  representation  space  M.d.  This 
relationship  gives  us  the  ability  to  infer  the  loss  of  one  hypothesis  from  the  losses  of  others. 
For  example,  consider  two  Texas  Hold’Em  strategies  which  differ  only  in  how  aggressively 
they  bet  after  seeing  a  particular  sequence  of  play  like  “Q3  down,  no  bets,  557  flopped” : 
these  strategies  will  have  very  similar  expected  payoffs  against  any  opponent,  despite  being 
distinct  hypotheses. 

It  is  important  to  take  advantage  of  available  structure  in  T-  To  see  why,  imagine 
running  a  standard  no-regret  algorithm  such  as  weighted  majority  on  a  structured  y-.  to 
do  so,  we  must  give  it  hypotheses  corresponding  to  the  extreme  points  c\ . . .  cn  of  y.  Our 
running  time  and  regret  bounds  will  then  depend  on  the  number  of  extreme  points  n.  If 
n  is  exponential  in  d  (as  for  sets  of  the  form  {Ay  =  b,  y  >  0}),  we  will  have  difficulty 
keeping  track  of  our  past  loss  functions  in  a  reasonable  amount  of  time  and  space,  and  our 
regret  bounds  may  be  larger  than  necessary.  If  n  is  infinite  (as  for  spheres),  the  situation 
will  be  even  worse:  it  will  be  impossible  to  remember  our  loss  functions  at  all  without 
some  kind  of  trick,  and  our  regret  bounds  will  be  vacuous. 
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2.2  Reductions  which  simplify  notation 

If  y  is  convex,  there  will  never  be  any  need  for  our  algorithm  to  randomize:  for  any  convex 
loss  function  £,  we  have  £(E(y))  <  E(£(y))  by  Jensen’s  inequality,  so  we  can  replace  any 
distribution  over  y  by  its  expectation  without  hurting  our  performance.  On  the  other 
hand,  if  y  is  not  convex  our  algorithm  may  need  to  randomize  to  achieve  low  regret:  for 
example,  if  y  =  {0, 1},  it  is  impossible  for  a  deterministic  algorithm  to  guarantee  less  than 
0(t)  regret  in  t  trials. 

To  build  a  randomized  algorithm  we  will  allow  ourselves  to  pick  hypotheses  from  the 
convex  hull  of  y.  We  will  interpret  a  point  in  convT  as  a  probability  distribution  over 
the  elements  of  y  by  decomposing  y  =  YhiPiVii  where  y%  G  y,  pi  >  0,  and  YhiPi  =  1- 
(In  fact,  there  will  usually  be  several  such  representations  of  a  given  y;  different  ones  may 
yield  different  regrets,  but  they  will  all  satisfy  our  regret  bounds  below.)  For  convenience 
of  notation  we  will  take  y  to  be  a  convex  set  in  the  remainder  of  this  paper,  with  the 
understanding  that  some  elements  of  y  may  be  interpreted  as  randomized  actions. 

Our  algorithms  below  are  stated  in  terms  of  linear  loss  functions,  £t(y)  =  Q  •  y.  If 
£t  is  nonlinear  but  convex,  we  have  two  options:  first,  we  can  substitute  the  derivative 
at  the  current  prediction,  d£t(yt ),  for  q,  and  our  regret  bounds  will  still  hold  [1,  p.  54], 
Or,  second,  we  can  apply  the  standard  convex  programming  trick  of  adding  constraints  to 
make  our  objective  linear:  for  example,  if  our  fosses  are  KL-divergences 

£t{y)  =  yin  —  +  (1  -  y)ln^ — - 
Pt  1  ~Pt 

we  can  add  a  new  variable  z  and  a  new  constraint 

z  >  ylny  +  (1  -  y)  ln(l  -  y) 

resulting  in  a  new  feasible  region  y'  ?  We  can  then  write  an  equivalent  loss  function  which 
is  linear  over  y': 


lt{y,  z)  =  z  -  ylnpt  -  (1  -  y)  ln(l  -  pt)  (y,  z)  G  y' 

In  either  case  we  will  assume  in  the  remainder  of  this  paper  that  the  loss  functions  are 
linear,  and  we  will  write  C  for  the  set  of  possible  gradient  vectors  cj. 

3  Related  work 

A  large  number  of  researchers  have  studied  online  prediction  in  general  and  online  convex 
programming  in  particular.  From  the  online  prediction  literature,  the  closest  related  work 

technically,  we  must  also  add  a  vacuous  upper  bound  on  z  to  maintain  our  assumption  of  a  bounded 
feasible  region. 
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is  that  of  Cesa-Bianchi  and  Lugosi  [2] ,  which  follows  in  the  tradition  of  an  algorithm  and 
proof  by  Blackwell  [3].  Cesa-Bianchi  and  Lugosi  consider  choosing  predictions  from  an 
essentially-arbitrary  decision  space  and  receiving  outcomes  from  an  essentially-arbitrary 
outcome  space.  Together  a  decision  and  an  outcome  determine  how  a  marker  Rf  6 
will  move.  Given  a  potential  function  G,  they  present  algorithms  which  keep  G(Rt )  from 
growing  too  quickly.  This  result  is  similar  in  flavor  to  our  Theorem  5,  and  both  Theorem  5 
and  the  results  of  Cesa-Bianchi  and  Lugosi  are  based  on  Blackwell-like  conditions. 

The  main  differences  between  the  Cesa-Bianchi-Lugosi  results  and  ours  are  the  re¬ 
strictions  that  they  place  on  their  potential  functions.  They  write  their  potential  func¬ 
tion  as  G(u)  =  /($(«));  they  require  to  be  additive  (that  is,  $(it)  =  cf>i(ui)  for 
one-dinrensional  functions  (f>i),  nonnegative,  and  twice  differentiable,  and  they  require 
/  :  M+  i— >  R+  to  be  increasing,  concave,  and  twice  differentiable.  These  restrictions  rule 
out  many  of  the  potential  functions  used  here.  The  most  restrictive  requirement  is  that 
T  be  additive;  for  example,  unless  the  set  y  can  be  factored  as  34  x  34  x  . . .  x  3V  for 
one-dinrensional  sets  34,  34,  •  •  • ,  34v,  potential  functions  defined  via  Equation  (7)  are  gen¬ 
erally  not  expressible  as  /(<£(«))  for  additive  $.  The  differentiability  requirement  rules 
out  potential  functions  like  [x] ,  which  is  not  twice  differentiable  at  x  =  0.3 

Our  more  general  potential  functions  are  what  allow  us  to  define  no-regret  algorithms 
that  work  on  structured  hypothesis  spaces  like  the  set  of  paths  through  a  graph  or  the  set 
of  sequence  weights  in  an  extensive-form  game.  Ours  is  the  first  result  which  allows  one 
to  construct  such  potential  functions  easily:  combining  any  of  a  number  of  well-studied 
hedging  functions  (such  as  negentropy,  componentwise  negentropy,  or  squared  Lp  norms) 
with  an  arbitrary  compact  convex  hypothesis  set,  as  described  in  Section  6,  results  in  a  no¬ 
regret  algorithm.  Previous  results  such  as  Cesa-Bianchi  and  Lugosi’s  provide  no  guidance 
in  constructing  potentials  for  such  hypothesis  sets. 

In  the  online  convex  programming  literature,  perhaps  the  best  known  recent  related 
papers  are  those  of  Kalai  and  Vempala  [4]  and  Zinkevich  [5] .  The  online  convex  program¬ 
ming  problem  has  a  much  longer  history,  though:  the  first  description  of  the  problem  and 
the  first  algorithm  of  which  we  are  aware  were  presented  by  Hannan  in  1957  [6],  although 
Hannan  didn’t  use  the  name  “online  convex  programming.”  And,  the  current  author’s 
Generalized  Gradient  Descent  algorithm  [1,7]  solves  a  generalization  of  the  online  convex 
programming  problem,  although  it  was  not  originally  presented  in  those  terms:  if  each  of 
GGD’s  loss  functions  £t{y)  for  t  >  1  is  of  the  form  ct  ■  y  +  I(y),  where  I  is  0  inside  the 
feasible  set  and  00  outside,  then  GGD  solves  online  convex  programs.  If  in  addition  GGD’s 
prior  loss  £q (y)  is  proportional  to  ||y|| then  GGD  acts  like  Zinkevich’s  lazy  projection 
algorithm  with  a  fixed  learning  rate  [8] . 

3Cesa-Bianchi  and  Lugosi  claim  (p.  243)  that  their  results  apply  to  <fr(x)  =  [x\p+  with  p  >  2,  but  this 
appears  to  be  a  slight  error;  the  Taylor  expansion  step  in  the  proof  on  p.  242  requires  twice-differentiability 
and  therefore  needs  p  >  2.  My  thanks  to  Amy  Greenwald  for  pointing  this  fact  out  to  me. 
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Compared  to  the  above  online  convex  programming  papers,  the  most  important  con¬ 
tributions  of  the  current  paper  are  the  flexibility  of  its  algorithm  and  the  simplicity  and 
generality  of  its  proof.  Ours  is  the  first  algorithm  based  on  general  potential  functions 
which  can  solve  arbitrary  online  convex  programs.4  And,  our  proof  contains  as  special 
cases  most  of  the  common  no-regret  bounds,  including  for  example  those  for  Hedge  and 
weighted  majority:  while  our  overall  algorithm  is  new,  by  choosing  the  appropriate  poten¬ 
tial  functions  one  can  reduce  it  to  various  well-known  algorithms,  and  our  bounds  reduce 
to  the  corresponding  specific  bounds. 

The  flexibility  of  our  algorithm  comes  from  our  freedom  to  choose  from  a  wide  range 
of  potential  functions;  because  of  this  freedom  we  can  design  algorithms  which  force  their 
average  regret  to  zero  in  a  variety  of  ways.  For  example,  if  we  define  the  safe  set  S  as 
in  Section  4,  we  can  try  to  decrease  two-norm,  max-norm,  or  one-norm  distance  from 
S  as  rapidly  as  possible  by  choosing  hedging  functions  based  on  1 1  y  1 1 2 ,  negentropy,  or 
componentwise  negentropy  respectively.  The  simplicity  of  the  proof  results  from  our  use 
of  Blackwell-style  approachability  arguments;  our  core  result,  Theorem  5,  takes  only  half 
a  dozen  short  equations  to  prove.  This  theorem  is  the  first  generalization  of  well-known 
online  learning  results  such  as  Cesa-Bianchi  and  Lugosi’s  to  online  convex  programming, 
and  it  is  the  most  general  result  of  this  sort  that  we  know. 

More  minor  contributions  include:  our  bounds  are  better  than  those  of  previous  algo¬ 
rithms  such  as  that  of  Kalai  and  Vempala,  since  (unless  p  =  1  in  Theorem  3)  we  do  not 
need  to  adjust  a  learning  rate  based  on  prior  knowledge  of  the  number  of  trials.  And,  we 
are  not  aware  of  any  prior  application  of  online  learning  to  playing  extensive- form  games. 

In  addition  to  the  general  papers  above,  a  number  of  no-regret  algorithms  for  specific 
online  convex  programs  have  appeared  in  the  literature.  These  include  predicting  nearly 
as  well  as  the  best  pruning  of  a  decision  tree  [9],  reorganizing  a  binary  search  tree  online 
so  that  frequently-accessed  items  are  close  to  the  root  [4],  and  picking  paths  in  a  graph 
with  unknown  edge  costs  [10]. 

4  Regret  vectors  and  safe  sets 

Lagrangian  Hedging  algorithms  maintain  their  state  in  a  regret  vector.  This  vector  contains 
information  about  our  actual  losses  and  the  gradients  of  our  loss  functions.  Given  a  loss 
function  lt(y)  =  Cf  y  as  described  in  Section  2,  we  can  define  the  regret  vector  st  by  the 
recursion: 

st+ 1  =  st  +  ( yt  ■  ct)u  -  ct  (1) 

4The  current  author’s  GGD  and  MAP  algorithms  [1,  7]  can  both  handle  a  general  class  of  convex 
potential  functions  and  feasible  regions,  but  they  depend  either  on  an  adjustable  learning  rate  or  on  the 
degree  of  convexity  of  the  loss  functions  It  to  achieve  sublinear  regret. 
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Figure  1:  A  set  y  =  {yi  +  y2  =  1  ,y>  0} 
(thick  dark  line)  and  its  safe  set  S  (light 
shaded  region).  Note  y-s  <  0  for  all  y  G  y 
and  sGtS,  where  y  is  the  positive  orthant. 


si  <—  0 

for  t  <—  1,2,... 

yt  -  fist)  n 

if  yt  ■  u  >  0  then 

yt  <-  yt/ (yt  ■  u) 

else 

yt  <—  arbitrary  element  of  y 

fi 

Observe  ct,  compute  sj+i  from  (1) 

end 


Figure  2:  The  gradient  form  of  the  La- 
grangian  Hedging  algorithm. 


with  the  base  case  si  =  0.  Here  u  is  an  arbitrary  vector  which  satisfies  y  ■  u  =  1  for  all 
y  E  y.  If  necessary  we  can  append  a  constant  element  to  each  y  so  that  such  a  u  exists. 
Given  st  we  can  compute  our  regret  versus  any  hypothesis  y: 


t- 1  t- 1  t- 1 

y  •  St  =  5>  •  Cj)y  ■  u-^y  ■  Ci  =  Lt  ~^2y  ■  Ci  =  pt(y ) 

2=1  2=1  2=1 

This  property  justifies  the  name  “regret  vector.” 

We  can  define  a  safe  set,  in  which  our  regret  is  guaranteed  to  be  nonpositive: 

5  =  {s  |  (Vy  G  y)  y  •  s  <  0}  (2) 

The  goal  of  the  Lagrangian  Hedging  algorithm  will  be  to  keep  its  regret  vector  st  near  the 
safe  set  S. 

Figure  1  shows  an  example  of  the  safe  set  for  a  very  simple  hypothesis  space  in  M2.  As 
is  true  in  general,  this  example  demonstrates  that  S  is  a  convex  cone:  it  is  closed  under 
positive  linear  combinations  of  its  elements.  If  we  define  another  convex  cone 

3>  =  {Ay  |  y  €  y,  A  >  0}  (3) 

then  y  ■  s  <  0  for  all  s  6  S  and  y  G  y.  In  fact,  y  is  exactly  the  set  of  vectors  with  negative 
dot  products  with  all  of  S,  and  vice  versa:  the  two  cones  are  polar  to  each  other,  written 
S  =  3^  or  S1-  =  y.  See  Appendix  E  for  more  detail  on  the  properties  of  polar  cones. 
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5  The  Lagrangian  Hedging  algorithm 


We  will  present  the  general  Lagrangian  Hedging  algorithm  first,  then  show  how  to  im¬ 
plement  it  efficiently  for  specific  problems.  The  general  form  of  the  LH  algorithm  is  also 
called  the  gradient  form,  as  distinguished  from  the  optimization  form,  which  is  slightly 
less  general  but  often  easier  to  implement.  The  optimization  form  is  presented  below  in 
Section  6.  The  name  “Lagrangian  Hedging”  comes  from  the  fact  that  the  LH  algorithm 
is  a  generalization  of  Freund  and  Schapire’s  Hedge  algorithm  [11],  and  that  its  hypothesis 
can  be  thought  of  as  a  Lagrange  multiplier  for  a  constraint  which  keeps  its  regret  from 
growing  too  fast. 

At  each  time  step,  the  LH  algorithm  chooses  its  play  based  on  the  current  regret  vector 
St,  as  defined  in  Equation  (1).  The  LH  algorithm  depends  on  one  free  parameter,  a  closed 
convex  potential  function  F(s)  which  is  defined  everywhere  in  Wl.  The  potential  function 
should  be  small  when  s  is  in  the  safe  set,  and  large  when  s  is  far  from  the  safe  set. 

For  example,  suppose  that  y  is  the  probability  simplex  in  Rrf,  so  that  S  is  the  negative 
orthant  in  Mrf.  (This  choice  of  y  would  be  appropriate  for  playing  a  matrix  game  or 
predicting  from  expert  advice.)  For  this  y,  two  possible  potential  functions  are 

F\  (s)  =  In  ^  eVSi  —  In  d 
i 

where  y  is  a  positive  learning  rate,  and 

fhW  =  EMf/2 

i 

where  [s]  +  is  the  positive  part  of  s.  The  potential  F\  will  lead  to  the  Hedge  [11]  and 
weighted  majority  [12]  algorithms,  while  the  potential  F2  will  result  in  an  algorithm  called 
external-regret  matching  [13,  Theorem  B],  For  a  more  complicated  example  of  a  useful 
potential  function,  see  Section  9  below. 

In  order  for  the  LH  algorithm  to  be  well-defined  we  require 

F(s)  <  0  Vs  e  S  (4) 

We  will  impose  additional  requirements  on  F  later  for  our  regret  bounds.  We  will  write 
f(s)  for  an  arbitrary  subgradient  of  F;  that  is,  f(s)  G  dF(s)  for  all  s.  (For  an  introduction 
to  subgradients  and  convex  analysis,  see  Appendix  E.  Such  an  /  is  guaranteed  to  exist 
since  F  is  finite  everywhere.) 

The  LH  algorithm  is  shown  in  Figure  2.  On  each  step,  it  computes  yt  =  /(st),  then 
renormalizes  to  get  yt .  To  show  that  the  LH  algorithm  is  well-defined,  we  need  to  prove 
that  yt  is  always  a  valid  hypothesis;  Theorem  1,  whose  proof  is  given  in  Appendix  C,  does 
so.  (Recall  that,  as  described  in  Section  2,  we  can  replace  a  non-convex  y  by  convjL  and 
interpret  the  elements  of  convT  as  probability  distributions  over  the  original  y.) 
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Theorem  1  The  LH  algorithm  is  well-defined:  given  a  closed  convex  hypothesis  set  y 
and  a  vector  u  with  u  ■  y  =  1  for  all  y  £  y,  define  S  as  in  (2)  and  fix  a  convex  potential 
function  F  which  is  everywhere  finite.  If  F(s)  <  0  for  all  s  €  S,  then  the  LH  algorithm 
with  potential  F  picks  hypotheses  yt  G  y  for  all  t. 

We  can  also  define  a  version  of  the  LH  algorithm  with  an  adjustable  learning  rate:  if 
we  use  the  potential  function  F(rjs )  instead  of  F(s),  the  result  is  equivalent  to  updating 
st  with  a  learning  rate  g.  Below,  the  ability  to  adjust  our  learning  rate  will  help  us  obtain 
regret  bounds  for  some  classes  of  potential  functions. 


6  The  optimization  form 


Even  if  we  have  a  convenient  representation  of  our  hypothesis  space  y,  it  may  not  be  easy 
to  work  directly  with  the  safe  set  S.  In  particular,  it  may  be  difficult  to  define,  evaluate, 
and  differentiate  a  potential  function  F  which  has  the  necessary  properties. 

For  example,  a  typical  choice  for  F  is  “squared  Euclidean  distance  from  ST  If  S  is 
the  negative  orthant  (as  it  would  be  for  standard  experts  algorithms),  then  F  is  easy  to 
work  with:  we  can  separate  F  into  a  sum  of  d  simple  terms,  one  for  each  dimension.  On 
the  other  hand,  if  S  is  the  safe  set  for  a  complicated  hypothesis  space  (such  as  y  =  {y  > 
0  |  Ay  +  6  =  0}  for  some  matrix  A  and  vector  6),  it  is  not  obvious  how  to  compute  S, 
F(s),  or  dF(s )  efficiently:  F  can  have  many  quadratic  pieces  with  boundaries  at  many 
different  orientations,  and  there  is  generally  no  way  to  break  F  into  the  sum  of  a  small 
number  of  simple  terms.  For  the  same  reason,  it  may  also  be  difficult  to  prove  that  F  has 
the  curvature  properties  required  for  the  performance  analysis  of  Theorem  3. 

To  avoid  these  difficulties,  we  can  work  with  an  alternate  form  of  the  Lagrangian 
Hedging  algorithm.  This  form,  called  the  optimization  form,  defines  F  in  terms  of  a 
simpler  function  W  which  we  will  call  the  hedging  function.  On  each  step,  it  computes  F 
and  d F  by  solving  an  optimization  problem  involving  W  and  the  hypothesis  set  y.  In  our 
example  above,  where  F  is  squared  Euclidean  distance  from  S ,  the  optimization  problem 
is  minimum-Euclidean-distance  projection:  we  split  s  into  two  orthogonal  components, 
one  in  y  and  one  in  S.  This  optimization  is  easy  since  we  have  a  compact  representation 
of  y.  And,  knowing  the  component  of  s  in  S  tells  us  which  quadratic  piece  of  F  is  active, 
making  it  easy  to  compute  F(s)  and  an  element  of  dF(s). 

For  example,  two  possible  hedging  functions  are 


IEi(y) 


Vi  ln  m  +  In  d  if  y  >  0,  =  1 

oo  otherwise 


(5) 


and 


W2{y)  =  YJv‘f/  2 


(6) 


If  y  is  the  probability  simplex  in  (so  that  S  is  the  negative  orthant  in  and  we  can 
choose  u  =  [1,1,...,  1]T),  then  Wi(y/rj)  and  W-iiy)  correspond  to  the  potential  functions 
F\  and  F2  from  Section  5  above.  So,  these  hedging  functions  result  in  the  weighted 
majority  and  external-regret  matching  algorithms  respectively.  In  these  examples,  since 
F±  and  F2  are  already  simple,  W\  and  W2  are  not  any  simpler.  For  an  example  where 
the  hedging  function  is  easy  to  write  analytically  but  the  potential  function  is  much  more 
complicated,  see  Section  9  below. 

For  the  optimization  form  of  the  LH  algorithm  to  be  well-defined,  W  should  be  convex, 
dom  W  n  y  should  be  nonempty,  W(y)  >0  for  all  y,  and  the  sets  {y  \  W(y)  +  s  ■  y  <  k} 
should  be  compact  for  all  s  and  k.  (The  last  condition  is  equivalent  to  saying  that  W  is 
closed  and  increases  strictly  faster  than  linearly  in  all  directions.)  Theorem  2  below  shows 
that,  under  these  assumptions,  the  two  forms  of  the  LH  algorithm  are  equivalent.  We  will 
impose  additional  requirements  on  W  later  for  our  regret  bounds. 

We  can  now  describe  the  optimization  problem  which  allows  us  to  implement  the  LH 
algorithm  using  W  and  y  instead  of  the  corresponding  potential  function  F.  Define  y  as 
in  (3).  Then  F  is  defined  to  be5 


F(s)  =  sup(s  •  y  -  W(y))  (7) 

y&y 

We  can  compute  F(s)  by  solving  (7),  but  for  the  LH  algorithm  we  need  dF  instead.  As 
Theorem  2  below  shows,  there  is  always  a  y  which  achieves  the  maximum  in  (7): 


y  £  arg max  {s  ■  y  —  W(y)) 
y&y 


(8) 


and  any  such  y  is  an  element  of  8F ;  so,  we  can  use  Equation  (8)  with  s  =  St  to  compute 
fjf  in  line  (*)  of  the  LH  algorithm  (Figure  2). 

To  gain  an  intuition  for  Equations  (7-8),  let  us  look  at  the  example  of  the  external- 
regret  matching  algorithm  in  more  detail.  Since  y  is  the  unit  simplex  in  Md,  y  is  the 
positive  orthant  in  Mrf.  So,  with  H^y)  =  ||y|||/2,  the  optimization  problem  (8)  will  be 
equivalent  to 

V  =  arg  min  -||s  -  y \\j 
j/elq.  ^ 


That  is,  y  is  the  projection  of  s  onto  by  minimum  Euclidean  distance.  It  is  not  hard 
to  verify  that  this  projection  replaces  the  negative  elements  of  s  with  zeros,  y  =  [s]  +  . 

sThis  definition  is  similar  to  the  definition  of  the  convex  dual  W*  (see  Appendix  E),  but  the  supremum 
is  over  y  £  y  instead  of  over  all  y.  As  a  result,  F  and  W*  can  be  very  different  functions.  As  pointed  out 
in  Appendix  B,  F  can  be  expressed  as  the  dual  of  a  function  related  to  W:  it  is  F  =  ( Iy  +  W)*,  where 
Iy  is  0  within  y  and  00  outside  of  y.  We  state  our  results  in  terms  of  W  rather  than  F*  because  W  will 
usually  be  a  simpler  function,  and  so  it  will  generally  be  easier  to  verify  properties  of  W. 
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Substituting  this  value  for  y  back  into  (7)  and  using  the  fact  that  s  ■  [s]+  =  [s]+  •  [s]  +  ,  the 
resulting  potential  function  is 

fiw  =  »  •  w+  -  E  wi/2  =  Ew2+/2 

i  i 

as  claimed  above.  This  potential  function  is  the  standard  one  for  analyzing  the  external- 
regret  matching  algorithm. 

Theorem  2  Let  W  be  convex,  domlT  n  y  be  nonempty,  and  W(y )  >0  for  all  y.  Suppose 
the  sets  {y  \  W(y)  +  s  ■  y  <  k}  are  compact  for  all  s  and  k.  Define  F  as  in  (7).  Then 
F  is  finite  and  F(s)  <  0  for  all  s  €  S.  And,  the  optimization  form  of  the  LH  algorithm 
using  the  hedging  function  W  is  equivalent  to  the  gradient  form  of  the  LH  algorithm  with 
potential  function  F. 

The  proof  of  Theorem  2  is  given  in  Appendix  C. 


7  Theoretical  results 

Our  main  theoretical  results  are  regret  bounds  for  the  LH  algorithm.  The  bounds  depend 
on  the  curvature  of  our  potential  function  F,  the  size  of  the  hypothesis  set  y,  and  the 
possible  slopes  C  of  our  loss  functions.  Intuitively,  F  must  be  neither  too  curved  nor  too 
flat  on  the  scale  of  the  updates  to  St  from  Equation  (1):  if  F  is  too  curved  then  dF  will 
change  too  quickly  and  our  hypothesis  yt  will  jump  around  a  lot,  while  if  F  is  too  flat 
then  we  will  not  react  quickly  enough  to  changes  in  regret. 

7.1  Gradient  form 

We  will  need  upper  and  lower  bounds  on  F.  We  will  assume 

F(s  +  A)  <  F(s)  +  A  ■  f(s)  +  Cj|A||2  (9) 

for  all  regret  vectors  s  and  increments  A,  and 

[F(s)  +  A]+  >  inf  B\\s  -  s'f  (10) 

s'£S 

for  all  s.  Here  ||  •  ||  is  an  arbitrary  finite  norm,  and  A>0,H>0,C'>0,  and  1  <  p  <  2  are 
constants.6  Equation  (9),  together  with  the  convexity  of  F,  implies  that  F  is  differentiable 

6The  number  p  has  nothing  to  do  with  the  chosen  norm;  for  example,  we  could  choose  p  =  1.5  but  use 
Euclidean  distance  (the  2-norm)  or  even  a  non -Lv  norm. 
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and  that  /  is  its  gradient;  the  LH  algorithm  is  still  applicable  if  F  is  not  differentiable, 
but  its  regret  bounds  are  weaker. 

We  will  bound  the  size  of  y  by  assuming  that 

\\y\\o<M  (li) 

for  all  y  in  y.  Here,  ||  •  ||G  is  the  dual  of  the  norm  used  in  Equation  (9).  (See  Appendix  E 
for  more  information  about  dual  norms.) 

The  size  of  our  update  to  St  (in  Equation  (1))  depends  on  the  hypothesis  set  T,  the 
cost  vector  set  C,  and  the  vector  u.  We  have  already  bounded  T;  rather  than  bounding  C 
and  u  separately,  we  will  assume  that  there  is  a  constant  D  so  that 

E(\\st+i  -  st\\2  \  st)  <  D  (12) 

Here  the  expectation  is  taken  with  respect  to  our  choice  of  hypothesis,  so  the  inequality 
must  hold  for  all  possible  values  of  ct .  (The  expectation  operator  is  only  necessary  if 
we  randomize  our  choice  of  hypothesis,  as  would  happen  if  y  is  the  convex  hull  of  some 
non-convex  set.  If  y  was  convex  to  begin  with,  we  need  not  randomize,  so  we  can  drop 
the  expectation  in  (12)  and  below.) 

Our  theorem  then  bounds  our  regret  in  terms  of  the  above  constants;  see  Appendix  A 
for  a  proof.  Since  the  bounds  are  sublinear  in  t,  they  show  that  Lagrangian  Hedging  is  a 
no-regret  algorithm  when  we  choose  an  appropriate  potential  F. 

Theorem  3  Suppose  the  potential  function  F  is  convex  and  satisfies  Equations  (4),  (9) 
and  (10).  Suppose  that  the  problem  definition  is  bounded  according  to  (11)  and  (12).  Then 
the  LH  algorithm  (Figure  2)  achieves  expected  regret 

E(pt+i(y ))  <  M((tCD  +  A)/B)1/p  =  0{tlE>) 

versus  any  hypothesis  y  6  T • 

If  p  =  1  the  above  bound  is  0(t).  But,  suppose  that  we  know  ahead  of  time  the  number 
of  trials  t  we  will  see.  Define  G(s)  =  F(r]s),  where 

V=  a/  A/(tCD) 

Then  the  LH  algorithm  with  potential  G  achieves  regret 

E(pt+i(y))  <  (: 2M/B)VtACD  =  0(Vt) 
for  any  hypothesis  y  £  T . 
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Figure  3:  Given  two  functions  F  (dashed  line)  and  G  (dash-dot),  we  can  define 
conv  min(i?,  G )  (solid  line)  to  be  the  pointwise  greatest  convex  function  H  such  that 
H(y)  <  min (F(y),  G(y))  for  all  y. 

7.2  Optimization  form 

In  order  to  apply  Theorem  3  to  the  optimization  form  of  the  LH  algorithm,  we  will  show 
how  to  transfer  bounds  on  the  hedging  function  W  to  the  potential  function  F.  An  upper 
bound  on  W  will  lead  to  a  lower  bound  on  F,  while  a  lower  bound  on  W  will  yield  an 
upper  bound  on  F.  The  ability  to  transfer  bounds  means  that,  in  order  to  analyze  or 
implement  the  optimization  form  of  the  LH  algorithm,  we  never  have  to  evaluate  the 
potential  function  F  or  its  derivative  explicitly.  Since  W  and  related  functions  may  not 
be  differentiable,  we  will  use  the  notation  of  convex  analysis  to  state  our  bounds;  see 
Appendix  E  for  definitions. 

For  our  upper  bound  on  F.  instead  of  (9)  we  will  assume  that  for  all  unnormalized 
hypotheses  yo  G  y  n  do mcW,  for  all  s  G  dW(yo),  and  for  all  y  G  y, 

W(y)  >  W(y0)  +  (y-yo)-s  +  (l/4C)\\y  -  y0\\2o  (13) 

Here  C  is  the  same  constant  as  in  Equation  (9)  and  ||  •  ||Q  is  the  dual  of  the  norm 
from  Equation  (9).  We  will  also  assume  that  y  n  relint dornVF  is  nonempty;  since 
rel  intdomfF  C  dorn  dW .  this  last  assumption  guarantees  that  (13)  isn’t  vacuous. 

For  our  lower  bound  on  F.  instead  of  (10)  we  will  assume 

conv  mm(W(y)  -  A  +  Iy(y),  I0(y))  <  B\\y/B\\g0  \/y  G  y  (14) 

Here  A  and  B  are  the  same  constants  as  in  (10),  ||  •  ||0  is  the  dual  of  the  norm  from 
Equation  (10),  and  Ik{v )  is  0  when  y  is  in  the  set  K  and  oo  otherwise.  The  operation 
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conv  min(i?,  G )  is  illustrated  in  Figure  3.  The  constant  q  is  defined  by  ^  |  =  1  where  p 

is  the  constant  from  (10).  Note  that,  since  1  <  p  <  2,  we  have  2  <  q  <  oo.  As  is  typical, 
we  will  follow  the  convention 

M°°  =  /[_i,i](x) 

So,  when  p  =  1,  Equation  (14)  is  equivalent  to 

conv  min(TT(y)  -  A  +  /^(y),  /0(y))  <0  Vy  G  3>  with  ||y||0  <  B 

Our  main  theoretical  result  about  the  optimization  form  of  the  LH  algorithm  is  that  the 
above  bounds  on  W  imply  the  corresponding  bounds  on  F. 

Theorem  4  Suppose  that  the  hedging  function  W  is  closed,  convex,  nonnegative,  and 
satisfies  Equations  (13)  and  (14)  with  the  constants  A,  B,  C,  and  2  <  q  <  oo  and  the 
finite  norm  ||  •  ||0.  Suppose  the  set  y  (~l  relint  domVF  is  nonempty.  Define  p  so  that 
^  ^  =  1.  Define  F  as  in  (7).  Then  the  optimization  form  of  the  LH  algorithm  using 

hedging  function  W  is  equivalent  to  the  gradient  form  using  potential  function  F ,  and  F 
satisfies  the  assumptions  of  Theorem  3  with  constants  A,  B,  C,  and  p  and  the  norm  ||  •  |j. 

Theorem  4  follows  directly  from  Theorems  2  and  9  (proven  in  Appendices  B  and  C). 
As  an  immediate  corollary  we  have  that  the  optimization  form  satisfies  all  of  the  same 
regret  bounds  as  the  gradient  form;  for  example,  if  the  problem  definition  is  bounded 
by  (11)  and  (12)  with  constants  M  and  D,  Theorem  3  shows  that  our  expected  regret  is 
bounded  by 

E(pt+i(y))  <  M((tCD  +  A)/B)1/p  =  0(1^) 
after  t  steps  versus  any  hypothesis  y  G  y. 

One  result  which  is  slightly  tricky  to  carry  over  is  the  use  of  learning  rates  to  achieve 
no  regret  when  p  =  1.  The  choice  of  learning  rate  and  the  resulting  bound  are  the  same 
as  in  Theorem  3,  but  the  implementation  is  slightly  different:  to  set  a  learning  rate  i]  >  0, 
we  want  to  use  the  potential 

G(s)  =  F(rjs)  =  sup  (■ rjs  ■  y  —  W(y)) 

yey 

Using  the  substitution  y  i— >  y/77,  we  have 

G(s)  =  sup  (. s-y-W (y/77)) 
y^y 

since  y/77  G  y  whenever  y  G  y.  So,  to  achieve  a  learning  rate  77,  we  just  need  to  replace 
W(y)  with  W(y/rf). 


13 


8  Examples 

8.1  Matrix  games  and  expert  advice 

The  classical  applications  of  no-regret  algorithms  are  learning  from  expert  advice  and 
learning  to  play  a  repeated  matrix  game.  These  two  tasks  are  essentially  equivalent,  since 
they  both  use  the  probability  simplex 

y  =  {y  I  y  >  o,  =  1} 

for  their  hypothesis  set.  This  choice  of  y  has  no  difficult  structure,  but  we  mention  it  to 
point  out  that  it  is  a  special  case  of  our  general  prediction  problem.  Standard  no-regret 
algorithms  such  as  Freund  and  Schapire’s  Hedge  [11],  Littlestone  and  Warmuth’s  weighted 
majority  [12],  and  Hart  and  Mas-Colell’s  external-regret  matching  [13,  Theorem  B]  are  all 
special  cases  of  the  LH  algorithm. 

For  definiteness,  we  will  consider  the  case  of  repeated  matrix  games.  On  step  t  we 
choose  a  probability  distribution  yt  over  our  possible  actions.  Our  opponent  plays  a 
mixed  strategy  zt  over  his  possible  actions,  and  we  receive  payoff  £t(yt)  =  ztT Myt  =  ct-yt 
where  M  is  our  payoff  matrix.  Our  problem  is  to  learn  how  to  play  well  from  experience: 
since  we  do  not  know  our  opponent’s  payoff  matrix,  we  wish  to  adjust  our  own  play  to 
achieve  high  reward  against  the  actual  sequence  of  plays  z±,  Z2,  ■  ■  ■  that  we  observe. 

8.1.1  External-regret  matching 

Perhaps  the  simplest  no-regret  algorithm  for  matrix  games  is  the  one  we  get  by  taking 
W(y)  =  II2/U i/2,  which  leads  to  F(s)  =  |j  [s] _)_ || |/2  as  described  above.  The  derivative  of 
F  is 

/(«)  =  M+ 

so  at  each  step  we  take  the  positive  components  of  our  regret  vector,  renormalize  them  to 
form  a  probability  distribution,  and  play  according  to  this  probability  distribution. 

Using  the  Euclidean  norm  ||  •  ||o,  it  is  easy  to  see  that  our  choice  of  W  satisfies  Equa¬ 
tion  (13)  with  C  =  1/2  and  Equation  (14)  with  A  =  0,  B  =  1/2,  and  p  =  q  =  2.  All 
elements  of  the  probability  simplex  y  are  bounded  by  1 1 2/ 1 1 2  <  1,  so  M  =  1  in  Equation  (11). 
And,  if  our  payoff  matrix  is  bounded  so  that  so  that  0  <  My  <  1,  then  q  £  [0,  l]rf  and 
yt  ■  ct  £  [0, 1]  in  (1),  so  our  regret  updates  are  in  [—1, 1]'/  That  means  that  we  can  take 
D  =  d  in  Equation  (12). 

Substituting  in  the  above  constants,  Theorem  3  tells  us  that  the  external-regret  match¬ 
ing  algorithm  has  regret 

E(pt+i(y))  <  y/td 

for  any  comparison  hypothesis  y  £  y. 
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8.1.2  Hedge 


Another  well-known  no-regret  algorithm  for  matrix  games  is  Hedge  [11].  To  reproduce 
this  algorithm,  we  can  use  the  potential  function 

F(s)  =  In  ^2  &Si  —  In  d 

i 

in  the  gradient  form  of  the  LH  algorithm.  The  gradient  of  F  is 

Si{s)  =  eSi/E^ 

So,  at  each  step  we  exponentiate  the  regrets  and  then  renormalize  to  get  a  probability 
distribution.  This  is  exactly  the  Hedge  algorithm:  the  usual  formulation  of  Hedge  says  to 
exponentiate  the  sum  of  the  loss  vectors  instead  of  the  regret  vector,  but  since  the  regret 
differs  from  the  sum  of  the  losses  by  a  multiple  oi  u  =  (1,1,...,1)T,  the  difference  gets 
canceled  out  in  the  normalizing  constant. 

For  the  generalizations  of  Hedge  which  we  will  examine  below,  it  will  be  helpful  to 
prove  our  bounds  using  the  optimization  form  of  the  LH  algorithm.  In  the  optimization 
form,  Hedge  uses  the  entropy  hedging  function  shown  in  Equation  (5).  This  choice  of  W 
is  finite  only  inside  y  =  M+,  so  the  optimization  (7)  just  computes  VF*(s);  it  is  a  standard 
result  that  the  F  given  above  is  equal  to  W*. 

Using  the  L\  norm  ||  •  ||i,  our  choice  of  W  satisfies  Equation  (13)  with  C  =  1/2  and 
Equation  (14)  with  A  =  Inc?,  B  =  1,  p  =  1,  and  q  =  oo.  For  a  proof,  see  Lemma  10 
in  Appendix  D.  All  elements  of  the  probability  simplex  y  are  bounded  by  ||y||i  <  1,  so 
M  =  1  in  Equation  (11).  Finally,  our  regret  updates  are  in  [—1,  l]rf  and  so  have  max  norm 
no  more  than  1;  so,  we  can  take  D  =  1  in  Equation  (12). 

Substituting  in  the  above  constants,  Theorems  3  and  4  tell  us  that  the  Hedge  algorithm 
with  learning  rate  rj  =  1  has 

E(pt+i(y))  <  t/2  +  \nd 

for  any  comparison  hypothesis  y.  If  we  pick  instead  rj  =  (2  In  d)/t,  the  bound  becomes 

F(pt+i(y))  <  V2tlnd 

This  result  is  similar  to  well-known  bounds  on  Hedge  such  as  the  one  obtained  by  Freund 
and  Schapire  [11,  section  2.2].  Translated  to  our  notation,  Freund  and  Schapire  chose  a 
learning  rate  of 

rj  =  ln(l  +  7  (2  In  d)/t) 

which  is  slightly  slower  than  our  learning  rate.  They  used  this  learning  rate  to  prove  a 
regret  bound  of 

i/(21n  d)/t  +  (In  d)/t 
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Figure  4:  Synthetic  example  of  a  structured  prediction  problem.  Left:  domain  of  x.  Right: 
domain  of  y. 

per  trial,  which  is  slightly  weaker  than  our  bound  since  it  adds  a  term  depending  on  1  /t. 
As  t  — ►  oo,  the  difference  in  learning  rates  approaches  zero  and  the  0(l/t)  term  becomes 
irrelevant,  so  the  two  bounds  become  equivalent.7 


8.2  A  simple  synthetic  example 


This  subsection  presents  a  simple  synthetic  example  of  a  structured  prediction  problem 
and  an  LH  algorithm  which  solves  it.  Unlike  the  examples  in  the  previous  subsection, 
there  is  no  obvious  way  to  select  a  potential  function  for  this  problem  without  either 
using  the  techniques  described  in  this  paper  or  moving  to  a  less  efficient  representation 
(such  as  the  one  where  each  corner  of  y  has  a  separate  regret).  In  addition,  this  example 
demonstrates  how  to  apply  the  LH  algorithm  to  regression  or  classification  problems:  in 
these  problems,  each  example  consists  of  an  input  vector  xt  together  with  a  target  output 
zt,  and  our  hypothesis  space  is  a  set  of  functions  y  which  map  inputs  to  outputs. 

In  our  synthetic  problem,  the  input  examples  xt  are  drawn  from  the  pentagon  X  shown 
at  the  left  of  Figure  4,  and  the  target  outputs  zt  are  either  +1  or  — 1.  Our  predictions  are 
linear  functions  which  map  X  into  the  interval  [—1,1];  the  set  y  of  such  functions  is  the 
geometric  dual  of  X,  which  is  the  pentagon  shown  on  the  right  of  Figure  4.  We  will  use 
the  absolute  loss 


y  ■  xt 
-y  ■  xt 


if  zt  =  -1 
if  Zt  =  +1 


or  more  compactly  itiv)  =  ~ztxt  •  y- 

Specifying  y  and  It  completely  describes  our  prediction  problem.  The  set  y  is  not 
particularly  complicated,  but  it  does  not  match  the  hypothesis  sets  for  any  of  the  standard 

7The  extra  term  in  Freund  and  Schapire’s  bound  appears  to  be  due  to  the  fact  that  they  write  the 
recommended  distribution  of  actions  as  (3~s /Z  rather  than  exp (r/S)/Z,  requiring  an  extra  linearization 
step  ln(l  +  (3)  <  /3  in  their  proofs. 
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y. 


-1.5 


Figure  5:  Hypothesis  space  y  after  including  constant  component,  together  with  the  cone 
y  containing  y. 

no-regret  algorithms.  So,  we  will  design  a  Lagrangian  Hedging  algorithm  instead. 

In  order  to  construct  an  LH  algorithm  we  need  a  vector  u  with  u  ■  y  =  1  for  all  y  £  y. 
Since  such  a  vector  doesn’t  exist  for  the  T  shown  in  Figure  4,  we  will  add  a  dummy 
dimension  to  the  problem:  we  will  set  the  third  element  of  y  to  be  1  for  all  y  €  y,  and  add 
a  corresponding  third  element  of  0  onto  each  x  so  that  the  predictions  remain  unchanged. 
The  modified  T  is  shown  in  Figure  5  as  a  horizontal  pentagon.  Figure  5  also  shows  the 
boundaries  of  a  cone  extending  from  the  origin  through  3k  this  cone  is  y. 

With  our  modified  T  we  can  take  u  =  (0, 0, 1)T.  So,  the  only  thing  left  to  specify  in  our 
LH  algorithm  is  our  hedging  function  W.  For  simplicity  we  will  pick  squared  Euclidean 
norm,  ||y|||/2.  Having  chosen  a  hedging  function  we  can  now  apply  the  optimization  form 
of  the  LH  algorithm.  The  algorithm  starts  with  si  =  (0,0, 0)T,  then,  for  each  t,  executes 
the  following  steps: 

•  Project  St  onto  y  by  minimum  Euclidean  distance;  call  the  result  y. 

•  Normalize  y  to  get  y  =  y/(y  ■  u).  (If  y  ■  u  =  0  we  can  choose  y  £  y  arbitrarily.) 

•  Predict  zt  =  y  ■  xt  and  then  find  out  the  true  zt- 

•  Update  st+i  <—  st  +  ztxt  -  zt{xt  ■  yt)u. 

To  apply  Theorem  3  to  our  algorithm  we  need  to  evaluate  the  constants  in  our  bounds. 
We  are  using  the  same  hedging  function  as  in  external-regret  matching,  so  the  constants 
A  =  0,  B  =  C  =  1/2,  and  p  =  q  =  2  remain  the  same,  as  does  the  choice  of  the  Euclidean 
norm.  To  determine  M  we  need  the  longest  vector  in  the  augmented  y.  This  vector  has 
length  1.5:  the  size  of  the  unaugmented  y  is  \/5/2,  and  adding  a  constant  component  of 
1  yields  vectors  of  length  up  to  \J  1  +  5/4  =  1.5.  For  D  we  need  the  squared  length  of  the 
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largest  possible  update  to  St-  Since  ztXt  has  length  at  most  \/5/2  and  zt(xt  ■  yt )  £  [— 1, 1], 
the  update  has  length  at  most  1.5,  and  we  can  take  D  =  2.25.  Putting  all  of  these  values 
together,  our  final  bound  is 


E(pt+i)  <  2.25 Vt 


8.3  Other  applications 

A  large  variety  of  online  prediction  problems  can  be  cast  in  our  framework.  These  problems 
include  online  convex  programming  [1,4,5],  p-norrn  perceptrons  [2],  path  planning  when 
costs  are  chosen  by  an  adversary  [10],  planning  in  a  Markov  decision  process  when  costs 
are  chosen  by  an  adversary  [14],  online  pruning  of  a  decision  tree  [15],  and  online  balancing 
of  a  binary  search  tree  [4] .  In  each  case  the  bounds  provided  by  the  LH  algorithm  will  be 
polynomial  in  the  dimensionality  of  the  appropriate  hypothesis  set  and  sublinear  in  the 
number  of  trials.  Rather  than  re-proving  all  of  the  above  results  in  our  framework,  we  will 
illustrate  the  flexibility  of  the  LH  algorithm  by  turning  now  to  a  learning  problem  which 
has  not  previously  been  addressed  in  the  literature:  how  to  learn  to  play  an  extensive-form 
game. 


9  Extensive-form  games 

Extensive-form  games  such  as  poker  or  bridge  are  represented  by  game  trees  with  chance 
moves  and  incomplete  information.  A  behavior  strategy  for  a  player  in  an  extensive-form 
game  is  a  function  which  maps  an  information  state  (or  equivalently  a  history  of  actions 
and  observations)  to  a  distribution  over  available  actions.  The  number  of  distinct  behavior 
strategies  can  be  exponential  in  the  size  of  the  game  tree;  but,  by  using  the  sequence  form 
representation  of  a  game  [16],  we  can  design  algorithms  which  learn  behavior  strategies 
against  unknown  opponents,  achieve  0(y/t)  regret  over  t  trials,  and  run  in  polynomial 
time.  The  algorithms  described  below  are  the  first  with  all  of  these  properties. 

The  regret  bounds  for  our  algorithms  imply  that,  in  the  long  run,  our  learner  will 
achieve  average  cost  no  worse  than  its  safety  value,  no  matter  what  strategies  our  oppo¬ 
nents  play  and  without  advance  knowledge  of  the  payoffs.  (Depending  on  the  motivations 
of  our  opponents,  we  may  of  course  do  much  better.)  The  proof  of  this  property  is  identical 
to  the  one  given  for  matrix  games  by  Freund  and  Schapire  [11];  our  work  is  the  first  to 
demonstrate  this  property  in  general  extensive- form  games. 

We  assume  that  our  algorithm  finds  out,  after  each  trial,  both  its  cost  yt  ■  ct  and  the 
gradient  of  its  cost  q.  Dealing  with  reduced  feedback  would  be  possible,  but  is  beyond 
the  scope  of  this  paper.  (For  more  information,  see  for  example  [17, 18].) 
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9.1  The  sequence  form 

We  want  to  learn  how  to  act  in  an  extensive-form  game  through  repeated  play.  To  phrase 
this  task  as  a  structured  prediction  problem,  we  can  set  our  feasible  set  y  to  be  the  set 
Tseq  °f  valid  sequence  weight  vectors  for  our  player.  A  sequence  weight  vector  y  for  player 
i  will  contain  one  sequence  weight  ySiCLi  for  each  pair  (sj,aj),  where  s*  is  an  information 
state  where  it  is  i’s  turn  to  move  and  a*  is  one  of  i’s  available  actions  at  st.  All  weights 
are  nonnegative,  and  the  probability  of  taking  action  a*  in  state  Si  is  proportional  to  ySiai. 
The  set  y  is  convex,  and  the  payoff  for  a  strategy  y  £  y  is  a  linear  function  of  y  when  we 
hold  the  strategies  of  the  other  players  fixed. 

In  more  detail,  we  can  represent  player  i’s  information  state  just  before  her  kth  move 
by  a  sequence  of  alternating  observations  and  actions,  ending  in  an  observation: 

sl  =  (z\,  a\,  Z2, 0*2 1  ■  ■  ■ !  zk) 

An  edge  x  in  the  game  tree  is  uniquely  identified  by  the  most  recent  sequences  and  actions 
for  all  players,  x  =  (s1,  a1,  s 2,  a2, . . .). 

Player  i’s  policy  can  be  represented  by  a  weight  ys  “  for  each  of  her  state-action  pairs 
(s*,  a1),  defined  as 

=  P(a\  I  s[)P(4  I  4) . . .  P(4  I  4)  (15) 

Here  k  is  the  length  of  sl,  and  4  is  the  subsequence  of  sz  ending  with  z1-,  so  for  example 
4  =  sl.  We  have  written  P(a*  |  4)  for  the  probability  that  player  i  will  choose  action  a1- 
after  having  observed  4. 

The  valid  sequence  weight  vectors  satisfy  a  set  of  linear  constraints:  for  any  state  sl, 
the  weights  7/s'a’  for  different  actions  a1  share  all  terms  in  the  product  (15)  except  for  the 
last.  So,  if  we  sum  these  weights,  we  can  factor  out  the  first  k  —  1  terms  and  use  the  fact 
that  probabilities  sum  to  1  to  get  rid  of  the  kth.  term.  If  k  =  1,  there  was  only  one  term 
in  the  product  to  begin  with,  so  we  have: 

J2ys{a{  =  1  (16) 

a\ 

On  the  other  hand,  for  k  >  1,  the  first  k  —  1  terms  in  (15)  are  just  a  sequence  weight  from 
the  (k  —  l)st  move,  so  we  have: 


Together  with  the  requirement  of  nonnegativity,  we  will  write  these  constraints  as 

Xeq  =  {V  >  0  I  Alseqy  =  6*eq} 
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for  a  matrix  A*eq  and  vector  6*cq.  Note  that  the  total  number  of  nonzero  entries  in  the 
matrices  A*eq  and  6|cq  for  all  i  is  linear  in  the  size  of  the  original  game  tree.  Also  note 
that  any  vector  y  G  3-^eq  corresponds  to  a  valid  strategy  for  player  i:  the  probability  of 
choosing  action  a  given  history  sl  is 

P(a\si)  =  y8ia  /  Y,ySi< 

ak 

To  conclude  this  subsection  we  will  show  that  a  player’s  expected  cost  is  linear  in  her 
sequence  weights.  Given  an  edge  x  in  a  two-player  game  tree,  determined  by  the  sequence- 
action  pairs  (s^a1)  and  (s2,a2)  which  the  players  must  play  to  reach  x,  the  probability 
of  getting  to  x  is  just  the  product  of  the  conditional  probabilities  of  all  of  the  actions 
required  to  reach  x: 

P{x)  =  P(a\  |  -s\)P(a2  |  sj)P{al  \  s\)P(a22  |  s2)  . . . 

If  we  group  together  all  of  the  terms  for  player  l’s  actions,  we  get  a  sequence  weight  for 
player  1,  and  similarly  for  player  2: 

P(x)  =  [P{a\  |  s^Pial  |  4)  •••]  [p(a \  I  s\)P{a\  \  s|)  . . . ] 


Similarly,  in  an  n-player  game,  the  probability  of  reaching  the  edge 

x  =  (s1,  a1,  s2,  a2, . . . ,  sn,  an ) 


is 

P(x)  =  yslalys2a2 ...  ysHan 

If  the  cost  to  player  i  for  traversing  edge  x  is  clx,  then  V s  total  expected  cost  is 

E  <4 pW=  E  4 (V) 

xGedges  a:=(s1,a1,...,sTl,  an)  gedges 

which  is  linear  in  player  i’s  sequence  weights  if  we  hold  the  weights  for  the  other  players 
fixed. 

9.2  Algorithms 

As  noted  above,  if  we  are  controlling  player  i,  our  algorithms  will  choose  strategies  ?/£}’  = 
Tseq-  They  will  receive,  after  each  turn,  a  vector  ct  which  is  the  gradient  with  respect  to  y 
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of  the  expected  total  cost  to  player  i.  (We  can  compute  q  easily  by  differentiating  (17).) 
The  algorithms  will  then  update  their  regret  vector 


t- i  t- 1 

st  =  uY^yfCt-Y^ct  (18) 

i=l  i=  1 

Here  u  is  a  vector  with  u  •  y  =  1  for  all  y  G  TgCq-  For  example,  u  can  be  zero  everywhere 
except  for  Is  in  the  components  s,  a  corresponding  to  some  initial  state  s  and  all  actions 
a.  (Equation  (16)  guarantees  that  this  choice  of  u  satisfies  u  ■  y  =  1.) 

Given  st,  our  algorithms  will  choose  yt  by  an  optimization  involving  Tseq>  st,  and  a 
hedging  function  W .  We  can  specify  different  no-regret  algorithms  by  choosing  various 
hedging  functions.  Good  choices  include  quadratic  and  entropy-based  hedging  functions; 
these  result  in  extensive-form  versions  of  the  external-regret  matching  and  Hedge  algo¬ 
rithms. 

For  example,  the  EF  external-regret  matching  algorithm  runs  as  follows:  given  the 
regret  vector  st  from  (18),  solve  the  optimization  problem 

V  =  arg  max  (st  •  y  -  ||y||i/2)  (19) 

j/eysieq 


and  normalize  y  to  get  a  feasible  sequence  weight  vector  yt  G  Tseq-  The  set  T*eq  can  be 
written 

Xeq  =  {  y  >  0  I  Aieqy  =  A6*eq,  A  >  0  } 

Since  Tseq  can  be  described  by  linear  equalities  and  inequalities,  the  optimization  prob¬ 
lem  (19)  is  a  convex  quadratic  program  and  can  be  solved  in  polynomial  time  [19]. 

The  EF  Hedge  algorithm  solves  instead  the  optimization  problem 

V  =  arg  max  (st  ■  y  -  Wi(y)) 
yeya*eq 

where  W\  is  defined  in  Equation  (5).  Equivalently,  we  can  solve 

maximize  z 

subject  to  z  <  st  ■y-Y.iVi  In  lh 
Ei  Vi_  = 1 

y  G  Tieq 

Because  this  optimization  problem  is  convex,  with  a  polynomial  number  of  linear  con¬ 
straints  and  a  single  simple  nonlinear  constraint,  we  can  use  a  number  of  algorithms 
to  solve  it  efficiently  starting  from  a  feasible  point  y°.  (We  can  get  such  a  y°  by,  e.g., 
renormalizing  the  sequence  weights  for  the  strategy  which  chooses  actions  uniformly  at 
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random.)  For  example,  there  is  a  fast  separation  oracle  for  the  constraints  in  (20),  so  we 
can  find  a  near-optimal  y  in  polynomial  time  using  the  ellipsoid  algorithm.  Or,  for  better 
practical  performance,  we  could  use  a  log-barrier  algorithm  such  as  the  one  described  in 
Boyd  and  Vandenberghe’s  text  [19]. 

9.3  Regret  bounds 

By  evaluating  the  constants  in  Theorem  3  we  can  show  regret  bounds  for  the  extensive- 
form  algorithms.  The  bound  for  extensive- form  external-regret  matching  is 

E(pt+i(y))  <  dVtd  (21) 

And,  the  bound  for  extensive-form  Hedge  is  E(pt+i(y))  <  2 dt  +  dlnd  for  rj  =  1;  choosing 
r]  =  a/ (In  d) /2 1  yields  regret 

E(pt+i(y))  <  2dV2tlnd  (22) 

So,  extensive-form  external-regret  matching  and  extensive-form  Hedge  are  both  no-regret 
algorithms. 

In  more  detail,  the  only  change  in  regret  bounds  when  we  move  from  the  original  Hedge 
and  external-regret  matching  algorithms  to  their  extensive-form  versions  is  that,  since  we 
have  changed  the  hypothesis  space  from  the  probability  simplex  to  the  more  complicated 
set  Tseqi  the  constants  D  and  M  are  different. 

For  the  quadratic  hedging  function,  the  constants  A  =  0,  B  =  C  =  1/2,  and  p  =  q  =  2 
remain  unchanged  from  the  analysis  of  the  original  external-regret  matching  algorithm. 
M  is  the  size  of  a  2- norm  ball  enclosing  (y*  .  This  constant  depends  on  exactly  which 
game  we  are  playing,  but  it  is  bounded  by  the  dimension  d  of  the  sequence  weight  vector 
since  each  sequence  weight  is  in  [0, 1]. 

The  bound  D  on  the  size  of  the  regret  update  depends  similarly  on  exactly  which  game 
we  are  playing.  We  will  we  assume  that  the  individual  edge  costs  are  in  [0, 1]  and  that  the 
total  cost  along  any  path  is  no  more  than  1.  The  first  assumption  means  that  our  cost 
vector  ct  is  in  [0,  l]d:  according  to  (17),  a  sequence  weight  ySiCLi  affects  the  total  cost  only 
through  terms  which  correspond  to  the  game  tree  edges  that  are  consistent  with  player 
i  playing  the  actions  specified  in  s*  and  a*.  The  weight  of  ySiai  in  each  of  these  terms  is 
the  product  of  the  cost  of  the  corresponding  edge  with  the  conditional  probability  that  we 
will  reach  the  edge  given  that  player  i  plays  her  prescribed  actions  and  the  other  players 
follow  their  given  policies.  Since  these  conditional  probabilities  sum  to  no  more  than  1 
and  since  the  costs  are  in  [0, 1],  the  gradient  with  respect  to  ySiCLi  will  be  in  [0, 1].  Finally, 
u  is  in  [0,  l]d  and  yt  ■  ct  €  [0, 1],  so  the  regret  update  is  in  [—1,  l]rf.  The  2-norm  radius  of 
[ — 1,  l]rf  is  d.  so  we  can  take  D  =  d.  Applying  Theorem  3  to  the  above  set  of  constants 
yields  the  bound  in  Equation  (21). 
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Figure  6:  Performance  in  self-play  (left)  and  against  a  fixed  opponent  (right). 


For  the  entropy  hedging  function,  M  is  the  size  of  a  1-norm  ball  enclosing  T,  so  we  can 
take  M  =  d.  And,  D  is  the  size  of  a  max-norm  ball  enclosing  our  regret  updates,  which 
is  D  =  1.  The  constants  A  =  lnd,  B  =  1,  C  =  1/2,  p  =  1,  and  q  =  oo  remain  unchanged 
from  ordinary  Hedge.  Applying  Theorem  3  to  the  above  set  of  constants  yields  the  bound 
in  Equation  (22). 

10  Experiments 

To  demonstrate  that  our  theoretical  bounds  translate  to  good  practical  performance,  we 
implemented  the  extensive-form  external-regret  matching  algorithm  of  Section  9  and  used 
it  to  learn  policies  for  the  game  of  one-card  poker.  In  one-card  poker,  two  players  (called 
the  gambler  and  the  dealer)  each  ante  $1  and  receive  one  card  from  a  13-card  deck.  The 
gambler  bets  first,  adding  either  $0  or  $1  to  the  pot.  Then  the  dealer  gets  a  chance  to  bet, 
again  either  $0  or  $1.  Finally,  if  the  gambler  bet  $0  and  the  dealer  bet  $1,  the  gambler 
gets  a  second  chance  to  bring  her  bet  up  to  $1.  If  either  player  bets  $0  when  the  other  has 
already  bet  $1,  that  player  folds  and  loses  her  ante.  If  neither  player  folds,  the  higher  card 
wins  the  pot,  resulting  in  a  net  gain  of  either  $1  or  $2  (equal  to  the  other  player’s  ante 
plus  the  bet  of  $0  or  $1).  As  mentioned  earlier,  in  contrast  to  the  usual  practice  in  poker 
we  assume  that  the  payoff  vector  ct  is  observable  after  each  hand;  the  partially-observable 
extension  is  beyond  the  scope  of  this  paper. 

One-card  poker  is  a  simple  game;  nonetheless  it  has  many  of  the  elements  of  more 
complicated  games,  including  incomplete  information,  chance  events,  and  multiple  stages. 
And,  optimal  play  requires  behaviors  like  randomization  and  bluffing.  The  biggest  strate¬ 
gic  difference  between  one-card  poker  and  larger  variants  such  as  draw,  stud,  or  hold-em 
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Figure  7:  Minimax  one-card  poker  strategies  learned  by  self-play.  Left:  gambler  bet 
probabilities  holding  different  cards.  First  round  in  blue,  second  round  in  green.  Right: 
dealer  bet  probabilities.  Probability  after  hearing  gambler  pass  in  blue,  after  hearing 
gambler  bet  in  green. 


is  the  idea  of  hand  potential:  while  45679  and  24679  are  almost  equally  strong  hands  in  a 
showdown  (they  are  both  9-high),  holding  45679  early  in  the  game  is  much  more  valuable 
because  replacing  the  9  with  either  a  3  or  an  8  turns  it  into  a  straight. 

Figure  6  shows  the  results  of  two  typical  runs:  in  both  panels  the  dealer  is  using  our 
no-regret  algorithm.  In  the  left  panel  the  gambler  is  also  using  our  no-regret  algorithm, 
while  in  the  right  panel  the  gambler  is  playing  a  fixed  policy.  The  x-axis  shows  number  of 
hands  played;  the  y-axis  shows  the  average  payoff  per  hand  from  the  dealer  to  the  gambler. 
The  value  of  the  game,  —$0,064,  is  indicated  with  a  dotted  line.  The  middle  solid  curve 
shows  the  actual  performance  of  the  dealer  (who  is  trying  to  minimize  the  payoff). 

The  upper  curve  measures  the  progress  of  the  dealer’s  learning:  after  every  fifth  hand 
we  extracted  a  strategy  y^vg  by  taking  the  average  of  our  algorithm’s  predictions  so  far. 
We  then  plotted  the  worst-case  value  of  y^g .  That  is,  we  plotted  the  payoff  for  playing 
y^vg  against  an  opponent  which  knows  y^vg  and  is  optimized  to  maximize  the  dealer’s 
losses.  Similarly,  the  lower  curve  measures  the  progress  of  the  gambler’s  learning. 

In  the  right  panel,  the  dealer  quickly  learns  to  win  against  the  non-adaptive  gambler. 
The  dealer  never  plays  a  minimax  strategy,  as  shown  by  the  fact  that  the  upper  curve 
does  not  approach  the  value  of  the  game.  Instead,  she  plays  to  take  advantage  of  the 
gambler’s  weaknesses.  In  the  left  panel,  the  gambler  adapts  and  forces  the  dealer  to  play 
more  conservatively;  in  this  case,  the  limiting  strategies  for  both  players  are  minimax,  as 
shown  in  Figure  7.  (Note  that  there  are  many  minimax  strategies  for  one-card  poker,  so 
these  plots  are  different  from  the  ones  reported  in,  e.g.,  [16].) 
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The  curves  in  the  left  panel  of  Figure  6  show  an  interesting  effect:  the  small,  damping 
oscillations  result  from  the  dealer  and  the  gambler  “chasing”  each  other  around  a  minimax 
strategy.  One  player  will  learn  to  exploit  a  weakness  in  the  other,  but  in  doing  so  will  open 
up  a  weakness  in  her  own  play;  then  the  second  player  will  adapt  to  try  to  take  advantage 
of  the  first,  and  the  cycle  will  repeat.  Each  weakness  will  be  smaller  than  the  last,  so  the 
sequence  of  strategies  will  converge  to  a  minimax  equilibrium.  This  cycling  behavior  is  a 
common  phenomenon  when  two  learning  players  play  against  each  other.  Many  learning 
algorithms  will  cycle  so  strongly  that  they  fail  to  achieve  the  value  of  the  game,  but  our 
regret  bounds  eliminate  this  possibility. 

11  Discussion  and  related  work 

We  have  presented  the  Lagrangian  Hedging  algorithms,  a  family  of  no-regret  algorithms 
which  can  handle  complex  structure  in  the  set  of  allowable  predictions.  We  have  proved 
regret  bounds  for  LH  algorithms  and  demonstrated  experimentally  that  these  bounds  lead 
to  good  predictive  performance  in  practice.  The  regret  bounds  for  LH  algorithms  have 
low-order  dependences  on  d,  the  number  of  dimensions  in  the  hypothesis  set  y.  This 
low-order  dependence  means  that  the  LH  algorithms  can  learn  well  in  prediction  problems 
with  complicated  hypothesis  sets;  these  problems  would  otherwise  require  an  impractical 
amount  of  training  data  and  computation  time. 

Our  work  builds  on  previous  work  in  online  learning  and  online  convex  programming. 
Our  contributions  include  a  new,  deterministic  algorithm;  a  simple,  general  proof;  the 
ability  to  build  algorithms  from  a  more  general  class  of  potential  functions;  and  a  new 
way  of  building  good  potential  functions  from  simpler  hedging  functions,  which  allows  us 
to  construct  potential  functions  for  arbitrary  convex  hypothesis  sets.  Future  work  includes 
a  no-internal-regret  version  of  the  LH  algorithm,  as  well  as  a  bandit-style  version.  The 
former  will  guarantee  convergence  to  a  correlated  equilibrium  in  nonzero- sum  games,  while 
the  latter  will  allow  us  to  work  from  incomplete  observations  of  the  cost  vector  (e.g.,  as 
might  happen  in  an  extensive- form  game  such  as  poker). 
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A  Proof  of  main  results — I 


This  appendix  contains  the  proof  of  Theorem  3.  The  result  as  given  in  Section  7  is  a 
straightforward  combination  of  Theorems  7  and  8,  stated  and  proved  below. 

Our  proof  proceeds  in  three  steps:  first  we  will  prove  a  general  result  about  gradient 
descent  (Theorem  5  below)  which  uses  our  upper  bound  on  F .  together  with  the  assump¬ 
tion  that  E(st+i  —  St)  never  points  in  the  same  direction  as  the  gradient  of  F,  to  bound  the 
rate  of  increase  of  F(st).  Then  we  will  show  that  the  LH  algorithm’s  choice  of  hypothesis 
means  that  st+i  —  st  satisfies  our  descent  assumption.  Finally,  we  will  combine  the  above 
results  with  our  lower  bound  on  F  to  show  that  St  itself  cannot  grow  too  quickly. 

A.l  Bounding  the  growth  of  F(st) 

In  order  to  prove  our  regret  bounds  we  will  need  our  potential  function  F  to  have  bounded 
curvature.  More  precisely,  we  will  require  that  there  exist  a  function  /,  a  seminorm  ||  •  ||, 
and  a  constant  C  so  that  Equation  (9)  on  p.  10  holds  for  all  s  and  A.8 

We  also  need  a  condition  on  our  updates  to  sp.  we  need  them  never  to  point  in  the 
same  direction  as  the  gradient  of  F(st).  That  is,  we  need 

E((st+ 1  -  st)  ■  f{st)  |  st)  <  0  (23) 

We  will  call  Equation  (23)  the  generalized  Blackwell  condition  since  it  is  similar  to  one 
of  the  conditions  of  Blackwell’s  approachability  theorem  [3].  Our  first  theorem  proves  a 
general  bound  on  the  growth  rate  of  F(st )  using  conditions  (9)  and  (23). 

Theorem  5  (Gradient  descent)  LetF(s)  and  f(s)  satisfy  Equation  (9)  using  the  semi¬ 
norm  ||  •  ||  and  the  constant  C.  Let  xq,x\,  ...  be  any  sequence  of  random  vectors.  Write 
st  =  Xu=o  xu  and  let  E(||a;t||2  I  st)  <  D  for  some  constant  D.  Suppose  that,  for  all  t, 
E(xt  ■  f(st)  |  st)  <  0.  Then  for  all  t, 

E(F(st+i)  |  si)  -  F(si)  <  tCD 

PROOF:  The  proof  is  by  induction.  For  t  =  0we  have 

F{si)  -  F(si)  <  0 


For  t  >  1,  assume  that 


E(F(st)  |  si)  <  F(si)  +  (t  -  1)CD 

8The  text  around  Equation  (9)  specifies  that  F  is  convex  and  that  ||  ■  ||  is  a  finite  norm,  but  Theorem  5 
holds  in  the  more  general  case  when  F  may  be  non-convex  and  ||s||  may  be  oo  or  0.  If  ||  ■  ||  is  a  norm  and 
F  is  convex  (as  will  be  the  case  in  our  application  of  Theorem  5  below),  then  Equation  (9)  implies  that  F 
is  differentiable  everywhere  and  that  /  is  its  gradient. 
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Then: 


F(st+ 1) 

= 

F(st  +  xt) 

< 

F(st)  +  xt  ■  f(st)  +  C'||xt||2 

E(F(st+ 1) 

1  st) 

< 

F(st)  +  CD 

E(F(st+ 1)  | 

Sl) 

< 

E(F(st)  |  Sl)  +  CD 

E(F(st+ 1)  | 

•Si) 

< 

F(si)  +  (t—  T)CD  +  CD 

which  is  the  desired  result.  The  first  line  above  follows  from  the  definition  of  st+i;  the 
second,  from  Equation  (9);  the  third,  from  taking  E(  ■  \  st )  on  both  sides,  then  using 
the  generalized  Blackwell  condition  and  our  assumption  about  ||xt||  to  bound  the  last 
two  terms;  the  fourth,  from  taking  E{  ■  |  .si)  on  both  sides  and  using  the  law  of  iterated 
expectations;  and  the  last,  from  the  inductive  hypothesis.  □ 

A. 2  The  expected  change  in  st 

We  would  like  to  apply  Theorem  5  to  bound  the  regret  of  the  Lagrangian  Hedging  algo¬ 
rithm.  To  do  so,  we  need  to  show  that  the  LH  algorithm  produces  a  sequence  of  regret 
vectors  st  that  satisfies  the  necessary  assumptions.  We  have  already  assumed,  in  Equa¬ 
tion  (12),  that  E(||sj+i  —  Si|| 2  |  st)  <  D.  So,  we  only  need  to  prove  that  the  sequence  st 
satisfies  the  generalized  Blackwell  condition,  Equation  (23).  The  following  lemma  does  so: 

Lemma  6  The  Lagrangian  Hedging  algorithm  produces  a  sequence  of  regret  vectors  st 
which  satisfies 

E((st+ 1  -  st)  •  ft  |  st)  <  0 

for  all  t,  where  ft.  £  dF(st). 

Proof:  We  will  choose  ft  to  be  equal  to  the  variable  fjt  from  Figure  2.  This  choice  means 
that  the  variable  yt  from  Figure  2  satisfies  kyt  =  ft  where  k  =  (yt  ■  uf)  >  0:  in  the  then 
clause  of  Figure  2  we  have  yt  ■  u>  0  so  we  can  just  multiply  through.  In  the  else  clause, 
yt  •  «  =  ().  This  means  yt  =  0:  since  yt  £  T,  we  can  write  yt  =  A y  for  some  y  £  y  and 
A  >  0.  Dotting  with  u  gives  us 

u  ■  yt  =  Xu  ■  y 


or 

0  =  A 

since  u  ■  y  =  1  for  any  y  £  y  by  the  definition  of  u.  So,  yt  —  0  =  kyt . 

Now,  Equation  (1)  tells  us  that  the  expected  change  in  the  regret  vector  is 


E(st+ 1  -  st  |  st)  =  ( ct  ■  yt)u  -  ct 
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where  ct  is  chosen  by  the  opponent  but  must  be  independent  of  yt .  Taking  the  dot  product 
with  yt  yields 

E((st+ 1  -  st)  •  yt  |  st)  =  ( ct  ■  yt)(u  ■  yt)  -  ct  ■  yt  =  0 

since  u  ■  yt  =  1.  Note  that  this  expected  value  does  not  depend  on  a :  the  opponent  can’t 
influence  the  expected  component  of  st+i  —  st  along  yt.  Multiplying  both  sides  by  k  and 
using  the  identity  kyt  =  ft  inside  the  expectation,  we  have 


E((st+ 1  -  st)  •  ft  |  st)  =  0 


which  proves  the  desired  result.  □ 

A. 3  Bounds  on  the  gradient  form 

In  addition  to  the  upper  bounds  in  Equation  (9),  we  will  need  a  lower  bound  on  the  growth 
of  F(s)  as  s  gets  far  away  from  the  safe  set  S:  without  such  a  bound,  we  would  be  able  to 
show  that  F(st)  doesn’t  grow  too  fast,  but  we  would  not  be  able  to  translate  that  result 
to  a  bound  on  st  itself. 

Depending  on  how  strong  a  lower  bound  we  can  prove  on  F,  we  will  get  different  results 
about  the  regret  of  our  algorithm.  The  strongest  results  (showing  that  our  average  regret 
decreases  as  0(1 /y/i))  will  hold  if  we  can  show  a  quadratic  lower  bound  on  F.  The  bounds 
will  get  progressively  weaker  as  our  bounds  on  F  get  looser,  until  the  weakest  possible 
lower  bound  on  F  (a  linear  growth  rate)  gives  us  the  weakest  possible  upper  bound  on 
regret.  (Adjusting  our  learning  rate,  as  described  below  in  Section  A. 4,  will  allow  us  to 
improve  some  of  these  bounds.) 

To  collect  all  of  these  results  into  a  single  theorem,  we  will  parameterize  our  lower 
bound  on  F  by  an  exponent  1  <  p  <  2,  as  shown  in  Equation  (10)  on  p.  10.  To  make  (10) 
be  a  non-vacuous  lower  bound,  we  will  require  ||  •  ||  to  be  a  norm  rather  than  a  seminorm. 
(That  is,  we  will  require  (||x||  =  0)  (x  =  0).  Note  that  ||  •  ||  must  be  finite  since  F  is 
finite.)  With  our  lower  bound  we  have  the  following  theorem: 

Theorem  7  Suppose  the  potential  function  F  is  convex  and  satisfies  Equations  (4),  (9), 
and  (10)  for  constants  A,  B,  C  and  p  and  a  norm  |j  •  ||.  Suppose  that  the  problem  definition 
is  bounded  according  to  (11)  and  (12)  for  constants  M  and  D.  Then  the  LH  algorithm 
(Figure  2)  achieves  expected  regret 

E(pt+i(y ))  <  M((tCD  +  A)/B)1/p  =  0(t1^) 

versus  any  hypothesis  y  €  y ■ 
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PROOF:  Equations  (9)  and  (12)  together  with  Lemma  6  show  that  F,  /,  and  the  update 
st+i  —  St  satisfy  the  assumptions  of  Theorem  5.  So, 

E(F(st+1)  |  ai)  -  F(s\)  <  tCD 

Since  si  is  a  fixed  constant  we  can  discard  the  conditioning,  and  since  q  £  5  we  have 
F(s\)  <  0  by  Equation  (4).  So, 


E(F(st+1))  <  tCD 

Since  F  is  convex,  Jensen’s  inequality  tells  us  that  F(E(st+ 1))  <  E(F(st+ 1)).  So,  writing 
s  =  E(st+i),  we  have 

F(s)  <  tCD 

Adding  A  on  both  sides  and  using  the  fact  that  tCD  +  A  >  0,  we  also  have 

[F(s)  +  A]+  <  tCD  +  A 

Now,  applying  (10)  shows  that 


B  inf  ||s  —  s||p  <  tCD  +  A  (24) 

sG«S 

The  function  is  monotone  on  M+;  so,  we  can  apply  it  to  both  sides  of  Equation  (24) 
and  then  move  it  inside  the  inf  operator  on  the  left-hand  side: 

B1/p  inf  \\s-s\\  <  {tCD  +  A)l/p  (25) 

seS 

Now  pick  any  y  G  y  and  s  €  S.  Our  expected  regret  versus  y  is 

E(pt+i(y))  =  s-y<(s-s)-y 

since  s  ■  y  <  0.  So,  for  any  y  e  y  and  s  €  S, 

E(pt+i(y))  <  (s  -  s)  ■  y  <  ||s  -  s||  IIj/Ho  <  M\\s  -  s||  (26) 

by  Holder’s  inequality  and  bound  (11).  Since  s  €  S  was  arbitrary,  we  will  pick  the  s  which 
makes  our  bound  tightest: 

E(pt+\{y))  <  M  inf  || s-  s|| 

sGS 

Finally,  substituting  in  Equation  (25)  gives  us 

E(Pt+i(y))  <  M((tCD  +  A)/Bf/p 


which  is  the  desired  result. 


□ 
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A. 4  Adjusting  the  learning  rate 

Theorem  7  shows  that  the  LH  algorithm  is  no-regret  so  long  as  p  >  1.  Some  algorithms 
(for  example,  weighted  majority)  need  p  =  1  in  their  analysis;  when  p  =  1,  we  can  use 
the  standard  trick  of  an  adjustable  learning  rate,  together  with  prior  knowledge  of  the 
number  of  trials,  to  achieve  regret  which  is  sublinear  in  t.  For  generality  we  will  calculate 
the  effect  of  adjusting  the  learning  rate  for  1  <  p  <  2,  although  in  practice  the  p  =  1  case 
is  the  most  important. 

As  described  in  Section  5,  we  can  add  a  learning  rate  ij  to  the  LH  algorithm  by  replacing 
F(s)  with  G(s)  =  F(ps).  If  F  satisfies  Equations  (9)  and  (10)  with  constants  A,  B,  C, 
and  p,  then  G  satisfies  them  as  well  but  with  different  constants:  since  dG(s )  =  rjdF(j)s ), 

G(s  +  x)  =  F(pS  +  px) 

<  F(ps)  +  px  ■  f(ps)  +  C\\px\\2 

<  G(s)  +  x  ■  g(s)  +  p2C\\x\\2 

And,  since  p s'  G  S  s'  G  S, 

[G(s)  +  A]+  = 

> 


So,  using  a  learning  rate  p  changes  the  constants  for  Equations  (9)  and  (10)  according  to 
A  i— >  A,  B  i— >  ppB,  C  e- >  p2C ,  and  p  *—>■  p.  By  setting  p  to  optimize  these  constants  we  can 
now  prove  the  following  theorem: 

Theorem  8  Suppose  that  F  is  convex  and  satisfies  Equations  (9)  and  (10)  with  constants 
A,  B,  C ,  and  1  <p  <2  and  the  norm  ||  •  |j.  Suppose  our  problem  definition  has  constants 
M  and  D  in  Equations  (11)  and  (12).  Let  t  be  the  anticipated  number  of  trials,  and  define 
G(s)  =  F(ps),  where 

p  =  sj pA/(tCD(2  -p)) 

Then  the  LH  algorithm  with  potential  G  achieves  regret  0(y/t).  In  particular,  if  p  =  1,  we 
have 

p  =  \J A/tCD 

and 

F(pt+i(y))  <  {2M/B)VtACD 

for  any  hypothesis  y  G  T • 


[F{ps)  +  A]  + 

inf  Blips  —  s/||p 
s'es 

inf  B\\ps  —  ps'\\p 
inf  ppB\\s  — 

s'£S 
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PROOF:  Theorem  7  shows 


E(Pt+i(y))  <  M((tV2CD  +  A) / {rjp B))l/p 

or  equivalently 

E(pt+1(y))  <  M((tp2~pCD  +  Ap~p)/Bf/p  (27) 

Minimizing  the  above  bound  with  respect  to  p  is  equivalent  to  solving 

i  [t tif~pCD  +  Ap~p }  =  0 
Since  0  <  p  <  2,  differentiating  yields 

(2  —  p)tp1~pCD  =  pAp~p~l 

and  therefore 

if  =  pA/(tCD{2  —  p )) 

which  is  the  learning  rate  given  in  the  theorem.  Substituting  this  value  of  p  back  into  our 
bound  gives 

E(pt+1(y))  <  M((tp2CD  +  A)/Bf/p/p 

=  M((pA/(  2  -p)  +  A)/Bf/p^tCD(  2  -  p)/(pA) 

=  0(Vt) 

as  required.  When  p  =  1,  the  learning  rate  simplifies  to  \J A/ ( tCD )  and  the  regret  bound 
simplifies  to  (2M / B)\J tACD .  □ 

Note  that  in  order  to  achieve  sublinear  regret  for  p  =  lwe  needed  advance  knowledge  of 
the  number  of  trials.9  This  sort  of  dependence  on  p  is  typical  of  results  in  the  literature: 
when  our  potential  function  is  superlinear  the  algorithm  can  in  effect  choose  its  own 
learning  rate,  while  if  the  potential  is  merely  linear  in  some  direction  leading  away  from 
S  we  need  to  select  a  learning  rate  based  on  external  knowledge. 

As  A  |  0,  the  recommended  learning  rate  gets  smaller  and  smaller.  If  A  were  0  the 
recommendation  would  be  p  =  0,  which  seems  like  a  contradiction.  But,  it  is  not  possible 
to  have  p  <  2  and  A  =  0:  take  A  >  0  and  A  0  S  with  /( 0)  •  A  <  0.  (Since  /( 0)  £  y,  such 
a  A  always  exists:  S  is  contained  in  any  halfspace  whose  normal  is  in  T,  and  since  we 
have  assumed  y  has  at  least  two  distinct  elements  the  containment  must  be  strict.)  Then 
Equation  (9)  at  regret  vector  s  =  0  and  increment  AA  requires 

F( AA)  <  F( 0)  +  AA  •  /( 0)  +  CHAAH2  <  C||AA||2 

9  At  the  cost  of  some  complexity  we  could  have  used  a  decreasing  sequence  of  learning  rates  to  sidestep 
this  requirement. 
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since  F( 0)  <  0  (because  0  £  S).  And,  Equation  (10)  requires 

F( AA)  >  B  inf  || AA  -  s'\\p 

s'eS 

These  two  bounds  are  inconsistent:  combined,  they  require 

A2  •  constant  >  Xp  ■  constant 

with  both  constants  strictly  positive,  which  cannot  hold  as  A  j  0  since  p  <  2. 

As  p  |  2,  the  recommended  learning  rate  gets  larger  and  larger.  If  p  =  2,  the  rec¬ 
ommended  learning  rate  will  be  p  =  oo  (unless  A  =  0,  in  which  case  Equation  (27)  is 
independent  of  rj):  while  the  analysis  in  the  proof  of  Theorem  8  doesn’t  apply,  it  is  easy 
to  see  that  increasing  r/  doesn’t  alter  the  ratio  C/B  in  Equation  (27)  and  decreases  A/B, 
thereby  improving  the  bound.  In  practice,  if  p  is  near  or  equal  to  2  and  A  >  0,  we  would 
recommend  setting  p  as  large  as  is  practical. 

B  Proof  of  main  results — II 

Theorems  7  and  8  bound  the  regret  of  the  gradient  form  of  the  LH  algorithm  in  terms 
of  properties  of  F.  For  the  optimization  form  we  are  not  given  the  potential  function 
F  directly,  so  we  cannot  check  the  conditions  of  these  theorems.  Instead  we  define  F  in 
terms  of  the  hedging  function  W  using  Equation  7.  Unlike  F .  there  is  no  need  for  W  to 
be  differentiable,  so  long  as  it  satisfies  the  required  assumptions. 

In  this  section  we  describe  how  to  transfer  bounds  on  the  hedging  function  W  to  the 
potential  function  F.  An  upper  bound  on  W  leads  to  a  lower  bound  on  F,  while  a  lower 
bound  on  W  yields  an  upper  bound  on  F.  The  ability  to  transfer  bounds  means  that, 
when  we  analyze  or  implement  the  optimization  form  of  the  LH  algorithm,  we  never  have 
to  evaluate  the  potential  function  F  or  its  derivative  explicitly. 

Our  bounds  on  W  are  detailed  above,  in  Section  7.  With  these  bounds  on  W,  we  can 
prove  the  required  bounds  on  the  potential  function  F: 

Theorem  9  Suppose  that  the  hedging  function  W  is  closed,  convex,  nonnegative,  and 
satisfies  Equations  (13)  and  (If)  with  the  constants  A,  B,  C,  and  2  <  q  <  oo  and  the 
finite  norm  ||  •  ||0.  Suppose  the  set  y  n  rel  int dorn  W  is  nonempty.  Define  p  so  that 
^  ^  =  1.  Then  the  function  F  defined  by  Equation  (7)  is  closed  and  convex  and  satisfies 

Equations  (f),  (9),  and  (10)  with  constants  A,  B,  C,  and  p  and  norm  ||  •  ||. 

Since  W  and  related  functions  may  not  be  differentiable,  we  will  use  the  notation  of 
convex  analysis  to  prove  our  bounds;  see  Appendix  E  for  definitions.  In  this  notation 
Equation  (7)  is  equivalent  to 

F  =  (Iy  +  wy  (28) 
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Figure  8:  Illustration  of  how  to  transfer  bounds  between  a  function  and  its  dual.  On  the 
left,  the  negentropy  function  and  a  quadratic  lower  bound;  on  the  right,  ln(l  +  ex )  and 
the  dual  quadratic  upper  bound. 

Here  Iy  represents  the  feasible  region  of  the  optimization  in  (7),  while  W  is  the  nonlinear 
part  of  the  objective.  (The  linear  part  of  the  objective  corresponds  to  the  argument  of  F.) 
By  moving  the  duality  operator  inside  the  parentheses,  we  can  see  that  Equation  (28)  is 
also  equivalent  to 

F  =  Is  □  W*  (29) 

since  infimal  convolution  is  dual  to  addition  and  Is  is  dual  to  Iy. 

Our  bounds  on  F  follow  from  the  simple  observation  that  the  duality  operator  reverses 
inequalities  between  functions,  as  illustrated  in  Figure  8:  for  closed  convex  functions  F 
and  G,  if  F*(y)  >  G*(y)  for  all  y,  then  G(s)  >  F(s)  for  all  s.  This  fact  is  a  direct 
consequence  of  the  definition  of  duality: 

G(s)  =  sup  [s  ■  y  -  G*(y)\  >  sup  [s  ■  y  -  F*(y )]  =  F(s)  (30) 

y  v 

where  the  inequality  holds  because  substituting  F*  for  G*  reduces  the  expression  in  square 
brackets  at  every  value  of  y,  and  therefore  reduces  the  supremum. 

We  can  use  the  inequality  (30)  almost  directly  to  turn  our  upper  bound  on  W  into  a 
lower  bound  on  F :  all  we  will  need  to  do  in  our  proof  below  is  add  Iy  to  both  sides  of 
Equation  (14)  and  take  the  dual.  To  prove  our  upper  bound  on  F.  on  the  other  hand, 
requires  a  slightly  more  complicated  argument. 

Returning  to  Figure  8,  notice  that  the  bound  on  the  left  is  tangent  to  F*(y)  at  the 
input  yo  =  0.7  with  slope  so  ~  0.85,  while  the  dual  bound  on  the  right  is  tangent  at  to 
F(s)  at  the  input  sq  with  slope  ytj.  This  sort  of  correspondence  holds  in  general:  the  slope 
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of  a  function  translates  to  the  argument  of  its  dual,  and  vice  versa.  So,  if  we  start  with  a 
lower  bound  on  F*  of  the  form  we  could  imagine  deriving  from  Equation  (13) 

F*(y)  >  Eb*ound(y)  =  F*(y0)  +  (y  -  y0)  ■  s0  +  L\\y  -  y0fo/2  (31) 

which  is  tangent  at  y  =  yo  with  slope  so,  we  end  up  with  an  upper  bound  on  F(s)  which 
is  tangent  at  s  =  so-  To  prove  (9),  we  need  to  produce  bounds  on  F  which  are  tangent  at 
every  possible  input  so;  so,  we  need  to  start  from  bounds  on  F*  which  have  every  possible 
slope  so  at  their  tangent  points.  The  proof  below  demonstrates  how  to  construct  such 
bounds  from  Equation  (13). 

Proof  (of  Theorem  9):  It  is  immediate  that  F  is  closed  and  convex,  since  F  is  defined 
as  the  dual  of  another  function  and  the  output  of  the  duality  operator  is  always  closed 
and  convex.  Equation  (4)  is  also  immediate:  in  (7), 

s-y-W(y)  <s-y 

since  W(y)  >  0;  so,  since  s  ■  y  <  0  for  all  s  €  S  and  y  £  T,  F(s)  <  0  for  all  s  €  S. 

Let  us  now  prove  the  lower  bound  on  F,  Equation  (10).  We  have  assumed  (Equa¬ 
tion  (14))  that 

conv  mm(W(y)  -  A  +  Iy(y),  I0(y ))  <  B\\y/B\\q0  My  <E  3> 

Adding  Iy  to  both  sides  yields 

conv  min (W(y)  -A  +  Iy(y),  I0(y ))  <  B\\y/B\\q  +  Iy(y)  (32) 

The  left-hand  side  was  already  infinite  for  y  0  y,  so  adding  Iy  had  no  effect.  Note  that 
we  have  dropped  the  qualifier  Vy  G  y  since  (32)  is  clearly  true  if  y  0  y. 

We  will  next  take  duals  on  both  sides  of  (32).  For  any  two  functions  X  and  Y ,  the 
dual  of  conv  min(X,  Y)  is  max(X*,  Y*)  and  the  dual  of  X  +  Y  is  X*  □  Y* .  The  dual  of 
the  indicator  function  for  a  cone  is  the  indicator  function  for  the  dual  cone;  for  example, 
the  dual  of  Iq  is  /R<i  =  0.  So,  writing  s  for  the  dual  variable,  we  have 

max((W(y)  -  A  +  Iy(y))*(s),  0)  >  (B\\y/B\\q)*(s)  □  Is(s) 

Since  F*  =  W  +  Iy,  we  can  simplify  the  first  argument  of  the  max: 

rna x(F(s)  +  A,  0)  >  (B\\y/B\\q)*(s)  □  Is(s) 

The  dual  of  ||  •  ||o  is  ||  •  ||p,  and  for  any  function  X  the  dual  of  aX(y)  is  aX*(s/a),  so  the 
dual  of  B\\y/B\\o  is  i?||s||p.  That  gives  us 

max(F(s)  +  A,  0)  >  S||s||p  □  Is(s) 
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which  is  equivalent  to  Equation  (10)  as  desired. 

For  the  upper  bound  on  F.  Equation  (9),  we  can  use  some  simple  identities  to  compute 
the  dual  of  a  function  of  the  form  E^ound  given  in  (31):  first,  the  dual  of  any  multiple  of 
a  squared  norm  is  a  multiple  of  the  squared  dual  norm. 

{L\\.f/2y  =  {l/L)\\.\\l/2  (33) 

Second,  adding  a  linear  function  to  an  arbitrary  convex  function  G  just  shifts  the  dual  of 
G  without  changing  its  basic  shape: 

{a-s  +  b  +  G{s)Y  =  G*{y-a)  -b  (34) 

Finally,  if  we  have  a  point  {yoiG*(yo))  where  there  is  a  tangent  to  G*  of  slope  sq ,  then 
the  function  G*(y)  —  y  ■  so  has  a  tangent  of  slope  0  at  y  =  yo-  So,  yo  is  a  minimum  of 
G*(y)  -  y-  s0,  and 

G(s0)  =  sup  (y-  s0-  G*(y ))  =  y0  ■  s0  -  G*(y0 )  (35) 

y 

Combining  the  identities  (33)  and  (34),  the  dual  of 

L\\y\\2o/2  +  s0-y  +  F*(y0) 
is 

(l/L)\\s-s0\\2/2-F*(yo) 

Using  Equation  (34)  again  (in  the  opposite  direction)  for  the  substitution  y  (y  —  yo) 
tells  us  that  the  dual  of  Ej(ound  is 

^bound(s)  =  (l/L)\\s  -  s0||2/2  -  F*(y0)  +  s  ■  y0 

Adding  and  subtracting  so  •  yo  and  using  (35)  gives  us 

^bound(s)  =  F(so)  +  (s  -  So)  •  yo  +  (1/E)||s  -  s0||2/2  (36) 

As  mentioned  above  in  the  main  text,  Ebound^o)  =  F(so)  and  by  Equation  (30)  we  have 
Abound  G5')  >  F(s)  for  all  s.  So,  to  prove  our  result  we  need  to  be  able  to  construct  an 
appropriate  Ej*mnd  from  Equation  (13)  for  any  desired  slope  so- 

First  we  will  show  that  F  must  be  finite  everywhere.  We  have  assumed  that  there 
exists  a  point  yo  £  y  H  rel  int  dom  W  C  dom  dW .  Write  sq  for  an  arbitrary  element  of 
dW(yo).  Now  Equation  (13)  tells  us  that 

F(s)  =  sup  (y  ■  s  —  W(y)) 
v&y 

<  sup  (y  ■  s  -W(yo)  -  s0  ■  (y  -  y0)  -  (1/4C)  \\y  -  y0\\l) 
y&y 

<  oo 
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because  the  expression  inside  the  supremum  is  bounded  above  (along  every  line  through 
yo  it  is  concave  and  quadratic,  with  bounded  slope  at  yo). 

Since  F  is  finite  everywhere,  dF  is  nonempty  everywhere.  So,  given  a  desired  so,  pick 
yo  £  dF(so)',  we  will  build  an  F£0 und  function  of  the  form  given  in  Equation  (31)  using 
this  choice  of  yo- 

By  duality  we  have  so  £  dF*{yo).  Since  F*  =  Iy  +  W  we  have  so  =  si  +  s 2  with 
si  £  dly(yo)  and  S2  £  dW(yo)  by  Theorem  23.8  of  [20,  p.  223].  Theorem  23.8  applies 
because  we  have  assumed  that  y  (~l  rel  int  dom  W  is  nonempty. 

The  existence  of  si  tells  us  that  yo  £  y,  and  similarly  the  existence  of  s 2  tells  us  that 
yo  £  dorncW.  So  by  assumption  Equation  (13)  holds  for  yo  and  S2- 

W{y)  >  W(y0)  +  (y  -  yo)  ■  s2  +  (l/4C)||y  -  y0\\l  Vy 

And  by  definition  of  subgradient, 

Iy(y)  >  I(yo)  +  (y -yo)  ■  s!  My 

Adding  these  two  inequalities  yields 

F*(y)>F*(y0)  +  (y-yo)-so  +  (l/4C)\\y-yo\\20  Vy  (37) 

Picking  L  =  1/2 C,  we  can  identify  Equation  (37)  with  Equation  (31).  So,  taking  the  dual 
of  both  sides,  we  have 

F(s)  <  F(s0)  +  (s  -  s0)  •  yo  +  Cp  -  s0||2  Vs 

as  we  derived  in  Equation  (36).  Since  so  was  arbitrary,  we  have  now  shown  that  F 
satisfies  (9),  which  finishes  the  proof  of  our  theorem.  □ 

C  Additional  proofs 

In  this  section  we  will  prove  that  the  two  forms  of  the  LH  algorithm  are  well-defined  and 
that  the  optimization  form  is  a  special  case  of  the  gradient  form. 

Proof  (of  Theorem  1):  Define  T  as  in  Equation  (3).  If  we  can  show  that  yt  £  T  then 
we  are  done:  if  yt  =  Ay,  then  yt  ■  u  =  A.  Either  A  >  0,  in  which  case  the  then  clause  in 
Figure  2  will  pick  yt  =  y  £  T,  or  A  =  0,  in  which  case  the  else  clause  will  pick  yt  £  y. 

By  convexity,  since  yt  £  dF(st), 

F(s)  >  F(st)  +  (s  -  St)  ■  yt 

For  all  s  £  S  we  have  F(s)  <  0,  so 

0  >  F(st)  +  (s-  st)  -yt  Vs  £  S 
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or,  rearranging  terms 


SfVt-  F(st)  >s-yt  Vs  €  S 
Since  as  G  S  for  all  a  >  0,  we  also  have 

st-yt-  F(st )  >  as  -yt 

(st-yt-  F(st))/a  >  s-yt 

0  >  s-yt  (38) 

for  all  s  £  S,  where  the  last  line  follows  because  we  can  make  a  arbitrarily  large. 

Now,  S  was  defined  as  yL,  or  equivalently  3^.  T  is  a  closed  convex  cone,  since  y  is 
closed  and  convex;  so,  saying  S  =  y1'  is  equivalent  to  saying  y  =  S1.  But,  S1-  is  exactly 
the  set  of  vectors  y  with  s  ■  y  <  0  for  all  s  £  S;  so,  inequality  (38)  shows  that  yt  £  y.  □ 

Proof  (of  Theorem  2):  To  show  F(s)  <0  for  all  s  6  S,  recall  that  s-y  <  0  for  all  s  €  S 
and  y  €  y.  Since  W(y)  >  0,  that  means  that  both  terms  inside  the  supremum  in  (7)  are 
nonpositive  for  all  feasible  y  when  s£5.  Since  there  is  at  least  one  feasible  y,  the  value 
of  the  supremum  must  also  be  nonpositive. 

To  show  equivalence,  consider  any  y  which  achieves  the  supremum  in  (7).  Such  a  y 
must  exist,  since  W(y)  +  Iy(y)  —s-y  is  closed,  convex,  not  everywhere  infinite,  and  has 
no  directions  of  recession  (see  [20,  Theorem  27.1(d),  p.  265]).  For  this  y, 

F(s  +  A)  =  sup((s  +  A)  •  y'  -  W(y')) 
y'ey 

>  ( s  +  A)-y-W{y ) 

=  A  -  y  +  {s-y-W{y)) 

=  A  -y  +  F(s) 

So,  y  G  dF(s),  which  is  what  was  required.  □ 

D  Analysis  of  the  entropy  function 

This  section  derives  the  constants  required  for  using  the  entropy  function  in  the  bounds 
of  Theorems  3  and  4. 

Lemma  10  If  y  is  the  d-dimensional  probability  simplex  and 

W{y)  =  In  d  +  ^2  Vi ln  V*  +  hiv) 

i 

then  Equation  (13)  holds  using  the  norm  ||  •  ||i  and  C  =  1/2.  And,  Equation  (If)  holds 
with  A  =  ln d,  B  =  1,  p  =  1,  and  q  =  oo. 
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PROOF:  We  will  verify  Equation  (13)  first.  Write 


W0{y)  =  In  d  +  ^2  yi  In  y{ 

i 

We  have  Wq  =  W  inside  y .  and  since  Wq  is  differentiable  in  all  of  ®++  it  will  be  easier  to 
work  with.  Pick  a  hypothesis  y  G  rel  intT  =  dorn dW  and  a  direction  A  with  ||A||i  =  1. 
Define 

WyA{\)  =  Wo{y  +  XA) 

Assume  without  loss  of  generality  that  Xi  ^  =  0  (since  Equation  (13)  holds  trivially  for 
y  +  AA  if  X*  Aj  y  0).  Now,  Equation  (13)  with  C  =  1/2,  evaluated  at  hypothesis  y  and 
increment  A  A,  becomes 

W(y  +  AA)  >W(y)  +  XA-s  +  A2/2  Vs  G  dW(y) 

Since  Xi^i  =  0,  we  may  without  loss  of  generality  take  s  =  W^(y).  That  means  that, 
since  W'y  a(A)  =  A  •  W'{){y  +  AA),  we  need  to  show 

Wy, a(A)  >  WyA{ 0)  +  AW',a(0)  +  A2/2  (39) 

Equation  (39)  holds  if  W"a(A)  >  1  for  all  A  such  that  y  +  AA  G  rel  int  y.  To  check  this 
condition,  we  can  calculate  derivatives  of  WyA  with  respect  to  A.  The  first  derivative  is 

A)  =  Aj(l  +  In  (yi  +  AA*)) 

i 

The  second  derivative  is 

J^IWa(A)  =  £>?/(!,,  + AA,) 
i 

or,  writing  x  =  y  +  AA, 

^WyA(X)  =  J2^/xi  (40) 

i 

We  want  to  verify  that  the  second  derivative  is  always  at  least  1,  so  we  will  find  the 
x  G  which  makes  (40)  as  small  as  possible.  Since  (40)  is  a  convex  function  of  x  which 
approaches  oo  as  any  component  of  x  approaches  0,10  the  second  derivative  is  smallest 

10Unless  A,  =  0  for  some  i,  in  which  case  we  can  fix  Xi  =  0  and  apply  the  rest  of  our  argument  to  the 
remaining  components  of  x.  To  see  why,  consider  any  j  such  that  A)  >  0.  To  reduce  (40)  we  want  to 
make  Xj  as  large  as  possible.  If  Xi  were  positive,  we  could  increase  Xj  by  reducing  Xi\  so,  Xi  cannot  be 
positive  at  the  minimum. 
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Figure  9:  The  function  W.  W  —  In  d  is  the  negentropy  function,  shown  as  the  solid  curve 
extending  downward  from  (0, 1,  0)  and  (1,0, 0),  while  W  is  the  shaded  surface.  A  contour 
plot  of  W  is  shown  projected  on  the  xy  plane.  W  is  the  greatest  convex  function  which 
satisfies  the  conditions  (a)  TF(0)  =  0  and  (b)  whenever  W(y)  is  finite,  W(y)  =  W(y)  —  lnd. 


when  the  gradient  of  (40)  with  respect  to  x  is  orthogonal  to  the  constraint  Ylixi  =  1. 
This  happens  when  there  is  some  constant  k  >  0  such  that 

A  i/xf  =  k  Vi 

or  equivalently  Xi  =  \fk\Ai\.  Since  £T|Aj|  =  1  and  Y2ixi  =  1,  we  have  k  =  1  and 
Xi  =  |Aj|.  Substituting  back  into  (40),  that  means 

>  E  a.2/ia(i  =  E  ia-i  = 1 

i  i 

for  all  A.  Since  y  and  A  were  arbitrary,  we  have  now  verified  that  W  satisfies  (13). 

For  the  second  part:  when  y  is  the  probability  simplex,  y  is  the  positive  orthant. 
Outside  the  positive  orthant,  (14)  holds  trivially.  Within  the  positive  orthant,  the  left- 
hand  side  of  (14)  is 

conv  min  (IF  —  In  d  +  Iy,  Iq)  =  W 

which  is  plotted  in  Figure  9.  W  is  negative  when  'Yhiyi  <  1,  while  the  right-hand  side 
of  (14)  is  /[_i  i](||y||i),  which  is  zero  when  YliVi  —  1-  Both  the  left-hand  and  right-hand 
sides  are  infinite  when  y,;  >  1.  □ 
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Figure  10:  Convex  functions  and  their  duals  (adapted  from  [20,21]). 

E  Convex  duality 

This  appendix  provides  some  standard  notation  and  results  from  convex  duality  which  are 
used  in  the  rest  of  the  paper.  For  more  information  on  convex  duality,  Rockafellar’s  text¬ 
book  [20]  is  a  good  resource;  an  introduction  with  a  focus  on  optimization  is  in  Chapters 
2-5  of  Boyd  and  Vandenberghe’s  textbook  [19]. 

A  set  of  points  is  called  convex  if  it  contains  all  weighted  averages  of  its  elements,  and 
it  is  called  closed  if  it  contains  all  limits  of  sequences  of  its  elements.  Given  a  function 
F(x),  define  the  set 

epi(F)  =  {(x,z)  |  2  >  F(x)} 

which  contains  the  graph  of  F  and  the  area  above  that  graph.  The  set  epi (F)  is  called  the 
epigraph  of  F.  We  will  say  that  the  function  F  is  convex  iff  epi (F)  is  convex,  and  closed 
iff  epi(-F)  is  closed.  F(x)  is  allowed  to  be  infinite,  in  which  case  epi (F)  has  no  elements  of 
the  form  (x,  z)  for  any  z.  The  set  {x  |  F{x)  <  oo}  is  called  the  domain  of  F,  dorn  F. 
Given  a  function  F(x),  its  convex  dual  is  defined  as 

F*(y)  =  sup  (x  ■  y  —  F(x))  (41) 

X 

F*  is  guaranteed  to  be  closed  and  convex,  and  if  F  is  closed  and  convex  then  F**  =  F. 
Any  yo  which  satisfies 

F{x)  >  F(x 0)  +  (x  -  xo)  ■  yo  Vx  (42) 
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is  called  a  subgradient  of  F  at  xq,  written  yo  G  dF(x o).  The  subgradient  exists  almost 
everywhere  that  F  is  defined:  for  example,  if  xo  is  in  the  relative  interior  of  dom  F,  then 
dF(x o)  is  nonempty.  The  subgradients  of  a  differentiable  convex  function  are  just  its 
gradients.  For  any  closed  convex  function  F.  x0  £  dF(yo)  iff  yo  G  dF*(x o);  that  is,  the 
subgradients  of  F  and  F*  are  inverses  of  one  another. 

Convex  duality  is  related  to  geometric  duality:  if  F(x)  =  Ic{x )  is  the  indicator  function 
of  a  cone  C,  then  the  dual  of  F  is  F*(y)  =  Ic±(y),  where  CL  is  the  dual  or  polar  cone  to 
C.  The  indicator  function  Ic  of  a  set  C  is  defined  by  Ic(%)  =  0  for  x  G  C  and  Ic(x )  =  oo 
for  x  0  C. 

Convex  duality  is  also  related  to  duality  of  seminorms.  Let  |j  •  ||  and  ||  •  ||D  be  dual 
seminorms.  Let  (j)  :  M  i— >  M  be  a  convex  function  with  <f>(x)  =  <fi(—x),  and  suppose  (j)  is 
monotone  nondecreasing  on  [0, oo).  Then  the  two  functions 

<MIMI)  0*(ll2/llo) 

are  dual  to  each  other.  (For  a  proof  of  the  above  result,  see  [20],  particularly  p.  110  and 
Theorem  15.3.)  As  an  example,  the  norms  ||x||i  =  )Th  |xj|  and  ||x||oo  =  max*  |xj|  are  dual 
to  each  other,  and  we  could  take  <j)(x)  =  x2/2.  In  this  case,  we  would  have  that  ||x||2/2 
and  ||y||^o/2  are  duals. 

Figure  10  lists  some  examples  of  functions  and  their  duals,  including  some  algebraic 
rules  for  computing  duals.  In  the  figure,  the  notation  FOG  means  the  infimal  convolution 
of  F  and  G, 

(F  O  G)(y)  =  mf  (F(y  -  z)  +  G(z)) 

Z 

Infimal  convolution  is  interesting  because  it  is  the  dual  of  addition: 

(. F  +  G )*  =  F*  □  G* 

As  an  example,  if  we  take  F(x)  =  ||x||f  and  G(x)  =  Ic(x)  for  a  cone  C,  then  (F  +  G)*(y) 
is  the  squared  distance  of  y  from  C ^  using  the  norm  |j  •  Hoc. 

A  final  useful  fact  is  that  the  convex  duality  operator  reverses  inequalities  between 
functions:  for  example,  if  F(x)  >  G(x)  for  all  x,  then  F*(y)  <  G*(y)  for  all  y. 
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